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Abstract — A method is proposed, called channel polarization, 
to construct code sequences that achieve the symmetric capacity 
I{W) of any given binary-input discrete memoryless channel (B- 
DMC) W. The symmetric capacity is the highest rate achievable 
subject to using the input letters of the channel with equal 
probability. Channel polarization refers to the fact that it is 
possible to synthesize, out of A^ independent copies of a given 
B-DMC W, a second set of TV binary-input channels {W^ : 
1 < j < A'^} such that, as A^ becomes large, the fraction of 
indices i for which /(W^S^ ) is near 1 approaches I{W) and 
the fraction for which I{W}^ ) is near approaches 1 — I{W). 
The polarized channels {W^ } are well-conditioned for channel 
coding: one need only send data at rate 1 through those with 
capacity near 1 and at rate through the remaining. Codes 
constructed on the basis of this idea are called polar codes. The 
paper proves that, given any B-DMC W with I{W) > and any 
target rate R < I{W), there exists a sequence of polar codes 
{e„; n > 1} such that e„ has block-length A'^ = 2", rate > R, and 
probability of block error under successive cancellation decoding 
bounded as Pe(Af, fi) < 0(Af 3) independently of the code rate. 
This performance is achievable by encoders and decoders with 
complexity 0{N log N) for each. 

Index Terms — Capacity-achieving codes, channel capacity, 
channel polarization, Plotkin construction, polar codes, Reed- 
MuUer codes, successive cancellation decoding. 



I. Introduction and overview 

A fascinating aspect of Shannon's proof of the noisy channel 
coding theorem is the random-coding method that he used 
to show the existence of capacity-achieving code sequences 
without exhibiting any specific such sequence [1]. Explicit 
construction of provably capacity-achieving code sequences 
with low encoding and decoding complexities has since then 
been an elusive goal. This paper is an attempt to meet this 
goal for the class of B-DMCs. 

We will give a description of the main ideas and results of 
the paper in this section. First, we give some definitions and 
state some basic facts that are used throughout the paper. 

A. Preliminaries 

We write W : X ^ y to denote a generic B-DMC 
with input alphabet X, output alphabet 3^, and transition 
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probabilities M^(y|a;), x E X, y E y. The input alphabet X 
will always be {0, 1}, the output alphabet and the transition 
probabilities may be arbitrary. We write W^ to denote the 
channel corresponding to N uses of W; thus, W^ : X^ -^ 
y^ with W^iy^ I xf ) = nil w{y^ I X,). 

Given a B-DMC W, there are two channel parameters of 
primary interest in this paper: the symmetric capacity 

yeyxex 



I{W) = 



lWiy\0) + ^W{y\l) 



and the Bhattacharyya parameter 



Z(W) ^ E VWiy\0)W{y\l). 
yey 

These parameters are used as measures of rate and reliability, 
respectively. I{W) is the highest rate at which reliable com- 
munication is possible across W using the inputs of W with 
equal frequency. Z{W) is an upper bound on the probability 
of maximum-likelihood (ML) decision error when W is used 
only once to transmit a or 1. 

It is easy to see that Z{W) takes values in [0, 1]. Through- 
out, we will use base-2 logarithms; hence, I{W) will also take 
values in [0, 1]. The unit for code rates and channel capacities 
will be bits. 

Intuitively, one would expect that I{W) w 1 iff Z{W) w 0, 
and I{W) « iff Z{W) « 1. The following bounds, proved 
in the Appendix, make this precise. 

Proposition I: For any B-DMC W, we have 

/(W^)>log ^ , !,„,, , (1) 



1 + Z{W)'' 



i{w) < y/i - z{wy 



(2) 



The symmetric capacity I{W) equals the Shannon capacity 
when VF is a symmetric channel, i.e., a channel for which 
there exists a permutation tt of the output alphabet y such 
that (i) TT-i = vr and (ii) W{y\l) = W{Tr{y)\0) for all yey. 
The binary symmetric channel (BSC) and the binary erasure 
channel (BEC) are examples of symmetric channels. A BSC 
is a B-DMC W with y = {0,1}, M^(0|0) = W{1\1), and 
W{1\0) = W{0\1). A B-DMC W is called a BEC if for each 
yey, either W{y\0)W{y\l) = or Wiy\0) = Wiy\l). In 
the latter case, y is said to be an erasure symbol. The sum 
of M^(y|0) over all erasure symbols y is called the erasure 
probability of the BEC. 



We denote random variables (RVs) by upper-case letters, 
such as X, Y, and their realizations (sample values) by the 
corresponding lower-case letters, such as x, y. For X a RV, Px 
denotes the probability assignment on X. For a joint ensemble 
of RVs {X, Y), Px,Y denotes the joint probability assignment. 
We use the standard notation I{X;Y), I{X;Y\Z) to denote 
the mutual information and its conditional form, respectively. 

We use the notation a^ as shorthand for denoting a row 
vector (oi, . . . , ajv)- Given such a vector a^ , we write aj,! < 
hJ<N,to denote the subvector (a^, . . . , aj); if j < i, aj is 
regarded as void. Given a^ and A C {1, . . . , N}, we write a_A 
to denote the subvector {ai : i G A). We write a{ ^ to denote 
the subvector with odd indices (a^ '■ I < k < j; k odd). We 
write a{ ^ to denote the subvector with even indices {uk ■ 1 < 
k < j; k even). For example, for af — (5, 4, 6, 2, 1), we have 
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Fig. 1. The channel VK2. 



The next level of the recursion is shown in Fig. 2 where two 
independent copies of W2 are combined to create the channel 
W4 : X* -^ y^ with transition probabilities W4{yf\uf) — 



a\ = (4,6,2), a\, ^ (4,2), af „ ^ (5, 6). The notation Of is W^2(j/r|wi © "2, "3 U4)VK2(y||u2, "4)- 
used to denote the all-zero vector. 

Code constructions in this paper will be carried out in vector 
spaces over the binary field GF(2). Unless specified otherwise, 
all vectors, matrices, and operations on them will be over 
GF(2). In particular, for af , 6f vectors over GF(2), we write 
of ® h^ to denote their componentwise mod-2 sum. The 
Kronecker product of an m-by-n matrix A = [Aij] and an 



r-by-s matrix B ~ [Bij] is defined as 



A<S)B 



AnB 



AmlB 



AmB 



A R 



which is an mr-hy-ns matrix. The Kronecker power A®" is 
defined as A (g) A'^^"-^'> for all n > 1. We will follow the 
convention that A®° = [1]. 

We write |^| to denote the number of elements in a set A. 
We write 1^ to denote the indicator function of a set A; thus, 
l^(x) equals 1 if x e ^ and otherwise. 

We use the standard Landau notation 0{N), o{N), (jj{N) 
to denote the asymptotic behavior of functions. 

B. Channel polarization 

Channel polarization is an operation by which one manu- 
factures out of N independent copies of a given B-DMC W 
a second set of N channels {W^ : 1 < i < N} that show 
a polarization effect in the sense that, as N becomes large, 
the symmetric capacity terms {7(W^ )} tend towards or 1 
for all but a vanishing fraction of indices i. This operation 
consists of a channel combining phase and a channel splitting 
phase. 

1) Channel combining: This phase combines copies of a 
given B-DMC VF in a recursive manner to produce a vector 
channel Wn ■ X^ -^ y^ , where N can be any power of two, 
N — 2", n > 0. The recursion begins at the 0-th level (n = 0) 
with only one copy of W and we set Wi = W. The first level 
(n — 1) of the recursion combines two independent copies of 
Wi as shown in Fig. 1 and obtains the channel W2 : X^ -^ y^ 
with the transition probabilities 

W2iyi,y2\ui,u2) = W{yi\ui®u2)W{y2\u2). (3) 











Wl 










•j(Ti 1 














Ul 


(^ a;i . 


W 




yi 




T 


\ 1 

\ / 

\ / 

\ / 

X 

/ \ 

/ \ 

/ \ 


V2 


•(v 


J 


















"2 


X2 ^ 


w 




2/2 




. /T\ 


V3 






















W2 


















U3 


,(^ ^3 , 


w 




ya 




'< 


J 


Vi 


'(.. 


J 


















U4 


Xi 


w 




yi 


































i?4 


W2 







Wi 



Fig. 2. The channel W4 and its relation to W2 and W . 



In Fig. 2, i?4 is the permutation operation that maps an input 



(si, S3, S2, S4). The mapping u\ 



(S1,S2, 33,54) to V^ 

from the input of W^ to the input of W 

'1000" 



as x? 



M{G4 with G4 



relation Wi{y\\u\) = W^{y\ 



1010 
1100 
1111 
,4 



'* can be written 
Thus, we have the 



ufG'4) between the transition 



probabilities of T4^4 and those of W'^. 

The general form of the recursion is shown in Fig. 3 where 
two independent copies of Wiq/2 are combined to produce the 
channel Wm- The input vector u^ to Wm is first transformed 
into s^ so that S2i-i — U2i-i ® U2i and S2i — U2i for 1 < 
i < N/2. The operator R^ in the figure is a permutation, 
known as the reverse shuffle operation, and acts on its input 
s^to produce v^ — (si, S3, ... , s^v-i, S2, S4, . . . , sn), which 
becomes the input to the two copies of Wn/2 as shown in the 
figure. 



We observe that the mapping u( 



N 



.,N 



is linear over 
GF(2). It follows by induction that the overall mapping u^ h^ 
x^ , from the input of the synthesized channel Wn to the input 
of the underlying raw channels W'^ , is also linear and may be 



represented by a matrix Gat so that x{ 
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Fig. 3. Recursive construction of Wf^ from two copies of W]v/2- 



the generator matrix of size N. The transition probabihties of 
the two channels Wn and W^ are related by 



Nf^.Ni^.N, 



W^^(yrK) = PF"(yrKG^) 



(4) 



for all yf e y 



jN „,N 



e X". We will show in Sect. ED that 



Gat equals BnF®"- for any A^ = 2", n > 0, where Bn is a 
permutation matrix known as bit-reversal and F — WW- Note 
that the channel combining operation is fully specified by the 
matrix F. Also note that Gn and F®" have the same set of 
rows, but in a different (bit-reversed) order; we will discuss 
this topic more fully in Sect. IVIII 

2) Channel splitting: Having synthesized the vector chan- 
nel Wm out of W^ , the next step of channel polarization 
is to split Wn back into a set of N binary-input coordinate 
channels VF^*^ : X ^ y^ ^ X'-^,l < i < N, defined by the 
transition probabilities 



(*)/', ,JV „,i-l| 



n"(yf>< 



E 



2JV 



-W^A.(yf|<), (5) 



where (yf^, u\ ^) denotes the output of Wj^ and Uj its input. 
To gain an intuitive understanding of the channels {Wjy }, 
consider a genie-aided successive cancellation decoder in 
which the ith decision element estimates Ui after observing 
Hi and the past channel inputs u^^"^ (supplied correctly by 
the genie regardless of any decision errors at earlier stages). 
If Ui is a-priori uniform on X'^ , then W/^ is the effective 
channel seen by the ith decision element in this scenario. 



{w^i;^} 



3} Channel polarization: 

Theorem 1: For any B-DMC W, the channels 
polarize in the sense that, for any fixed 5 e (0, 1), as N 
goes to infinity through powers of two, the fraction of indices 
i e {1, . . . , TV} for which /(W^*^) € (1 - (5, 1] goes to I{W) 
and the fraction for which /(M^*^) € [0, 5) goes to 1 - I{W). 

This theorem is proved in Sect. |IV] 
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Fig. 4. Plot of /(Wj^'') vs. j = 1, . . . , Af = 2^" for a BEC with e = 0.5. 

The polarization effect is illustrated in Fig. 4 for the case 
VF is a BEC with erasure probability e = 0.5. The numbers 
{liWj^ )} have been computed using the recursive relations 



/(w^f ) 



I{W^ 



CO )2^ 



N/2 

2/(<%)-/(<%)^ 



(6) 



with I{W^ ^) = 1 — e. This recursion is valid only for BECs 
and it is proved in Sect. [Till No efficient algorithm is known 
for calculation of {I{W^^)} for a general B-DMC W. 

Figure 4 shows that I{W^^'>) tends to be near for small 
i and near 1 for large i. However, I{Wp^ ) shows an erratic 
behavior for an intermediate range of i. For general B-DMCs, 
determining the subset of indices i for which /(VKJ^ ) is above 
a given threshold is an important computational problem that 
will be addressed in Sect. |IX] 

4) Rate of polarization: For proving coding theorems, the 
speed with which the polarization effect takes hold as a 
function of N is important. Our main result in this regard 
is given in terms of the parameters 

Z(<^)) = 



E E 



^W^{y^,ur'\0)W^iy^,n^'\r). 



(7) 



Theorem 2: For any B-DMC W with I{W) > 0, and any 
fixed R < I{W), there exists a sequence of sets An C 
{1,...,N}, N (E {1,2,..., 2",...}, such that \An\ > NR 
and Z{Wp) < 0{N-^/^) for all i e An- 

This theorem is proved in Sect. IIV-BI 



We stated the polarization result in Theorem |2] in terms 
{Z{Wj^ )} rather than {/(WJ^ )} because this form is better 
suited to the coding results that we will develop. A rate of 
polarization result in terms of {I(Wj^' 
Theorem |2] with the help of Prop. [T] 



polarization result in terms of {I(Wj^ )} can be obtained from 



C. Polar coding 

We take advantage of the polarization effect to construct 
codes that achieve the symmetric channel capacity I{W) by 
a method we call polar coding. The basic idea of polar 
coding is to create a coding system where one can access 
each coordinate channel W^ individually and send data only 
through those for which Z{W}^ ) is near 0. 

1) Gn-coscI codes: We first describe a class of block codes 
that contain polar codes — the codes of main interest — as a 
special case. The block-lengths N for this class are restricted 
to powers of two, N — 2" for some n > 0. For a given N, 
each code in the class is encoded in the same manner, namely. 



xf = uf G 



N 



(8) 



where Gn is the generator matrix of order N, defined above. 
For A an arbitrary subset of {1, ... , N}, we may write (|8]l as 



xf 



uaGn{A) Q ua'=Gn{A'') 



(9) 



where Gn{A) denotes the submatrix of Gn formed by the 
rows with indices in A. 

If we now fix A and ua<= , but leave u^. as a free variable, 
we obtain a mapping from source blocks u^ to codeword 
blocks x^ . This mapping is a coset code: it is a coset of 
the linear block code with generator matrix Gn{A), with the 
coset determined by the fixed vector ua'^Gn{A'^). We will 
refer to this class of codes collectively as Gn -coset codes. 
Individual Gjv-coset codes will be identified by a parameter 
vector {NjKjAjUA"), where K is the code dimension and 
specifies the size of ylj^The ratio K/N is called the code rate. 
We will refer to A as the information set and to w^c e ;\;N-k 
as frozen bits or vector. 

For example, the (4, 2, {2, 4}, (1, 0)) code has the encoder 
mapping 



4 4/--» 

Xi = Uj^G4 

= (U2,U4) 



10 10 

1111 



(1,0) 



10 
110 



(10) 



For a source block {u2, U4) = (1, 1), the coded block is xf = 
(1,1,0,1). 

Polar codes will be specified shortly by giving a particular 
rule for the selection of the information set A. 

2) A successive cancellation decoder: Consider a Gjy-coset 
code with parameter (N, K,A, ua")- Let u^ be encoded into 



a codeword a;( , let xf be sent over the channel W , and 
let a channel output y^ be received. The decoder's task is to 
generate an estimate u^ of u^ , given knowledge of A, ua", 
and y^ . Since the decoder can avoid errors in the frozen part 

'We include the redundant parameter K in the parameter set because often 
we consider an ensemble of codes with K fixed and yt free. 



by setting ua" = ua", the real decoding task is to generate 
an estimate ua of ua. 

The coding results in this paper will be given with respect 
to a specific successive cancellation (SC) decoder, unless some 
other decoder is mentioned. Given any {NjKjAjUA") Gn- 
coset code, we will use a SC decoder that generates its decision 
u^ by computing 



A \ Ui, 



if ieA"" 



[h,{y^,u\-'), ifieA 

in the order i from 1 to N, where hi : 3^^ x A"'^^ 
i G A, are decision functions defined as 

0, if^i^yl42B>i 



ht{yi ,u\ 



'i\ ^ 



otherwise 



(11) 
^ X, 

(12) 



for all yf € 3^^, u\~'^ G X^'\ We will say that a decoder 
block error occurred if uf 7^ u^ or equivalently if ua ^ ua- 

The decision functions {/i^} defined above resemble ML 
decision functions but are not exactly so, because they treat 
the future frozen bits {uj : j > i, j G A"^) as RVs, rather than 
as known bits. In exchange for this suboptimality, {hi} can be 
computed efficiently using recursive formulas, as we will show 
in Sect. [Ill Apart from algorithmic efficiency, the recursive 
structure of the decision functions is important because it 
renders the performance analysis of the decoder tractable. 
Fortunately, the loss in performance due to not using true 
ML decision functions happens to be negligible: I{W) is still 
achievable. 

3) Code performance: The notation Pe {N, K, A, u^ic ) will 
denote the probability of block error for a {N, K, A, u^c ) 
code, assuming that each data vector ua G X^ is sent 
with probability 2^^ and decoding is done by the above SC 
decoder. More precisely. 



P,iN,K,A,UA^) 



1 






WNil/^K) 



The average of P^ {N, K, A, ua" ) over all choices for u^c will 
be denoted by P^{N, K, A): 



A 



PeiN,K,A)^ 



1 



zl^ 2^ 



— PeiN,K,A,UA^). 



A key bound on block error probability under SC decoding 
is the following. 

Proposition 2: For any B-DMC W and any choice of the 
parameters {N, K, A), 

P,iN,K,A)<J2ziW^^)- (13) 

Hence, for each {N,K,A), there exists a frozen vector ua" 
such that 

Pe{N,K,A,UA^) <Y,Z{W^^^). (14) 

ieA 



This is proved in Sect. IV-BI This result suggests choosing A 
from among all i^-subsets of {1, . . . , N} so as to minimize 



the RHS of ( fTlT l. This idea leads to the definition of polar 
codes. 

4) Polar codes: Given a B-DMC W, a GAr-coset code with 
parameter {N, K, A, u^c ) will be called a polar code for W 
if the information set A is chosen as a ilT-element subset of 
{1,...,7V} such that Z{W^j^^) < Z{W^j^^) for all i e A, 
J e A'^. 

Polar codes are channel-specific designs: a polar code for 
one channel may not be a polar code for another. The main 
result of this paper will be to show that polar coding achieves 
the symmetric capacity I{W) of any given B-DMC W. 

An alternative rule for polar code definition would be to 
specify A as a iiT-element subset of {l,...,iV} such that 
liWJ^^) > /(Wjl/^) for all i e A, j e A". This alternative 
rule would also achieve I{W). However, the rule based on 
the Bhattacharyya parameters has the advantage of being 
connected with an explicit bound on block error probability. 

The polar code definition does not specify how the frozen 
vector u_Ac is to be chosen; it may be chosen at will. This 
degree of freedom in the choice of u_4c simplifies the perfor- 
mance analysis of polar codes by allowing averaging over an 
ensemble. However, it is not for analytical convenience alone 
that we do not specify a precise rule for selecting uj,c, but 
also because it appears that the code performance is relatively 
insensitive to that choice. In fact, we prove in Sect. |VI-B] that, 
for symmetric channels, any choice for u^c is as good as any 
other. 

5) Coding theorems: Fix a B-DMC W and a number 
R > 0. Let Pe{N,R) be defined as P^iN, INR\,A) with 
A selected in accordance with the polar coding rule for W. 
Thus, Pe (N, R) is the probability of block error under SC 
decoding for polar coding over W with block-length N and 
rate R, averaged over all choices for the frozen bits w^c . The 
main coding result of this paper is the following: 

Theorem 3: For any given B-DMC W and fixed R < 
I{W), block error probability for polar coding under succes- 
sive cancellation decoding satisfies 

Pe{N,R)^OiN--^). (15) 

This theorem follows as an easy corollary to Theorem|2]and 
the bound (fTSl ). as we show in Sect. IV-BI For symmetric chan- 
nels, we have the following stronger version of Theorem [3] 

Theorem 4: For any symmetric B-DMC W and any fixed 
R < I{W), consider any sequence of Gw-coset codes 
{N,K,A,UAc) with N increasing to infinity, K = [NR\, 
A chosen in accordance with the polar coding rule for W, 
and u^c fixed arbitrarily. The block error probability under 
successive cancellation decoding satisfies 

P,{N,K,A,UA^)^0{N--^). (16) 

This is proved in Sect. IVI-BI Note that for symmetric 
channels I{W) equals the Shannon capacity of W . 

6) Complexity: An important issue about polar coding is 
the complexity of encoding, decoding, and code construction. 
The recursive structure of the channel polarization construction 
leads to low-complexity encoding and decoding algorithms for 
the class of G^r-coset codes, and in particular, for polar codes. 



Theorem 5: For the class of Gjy-coset codes, the complex- 
ity of encoding and the complexity of successive cancellation 
decoding are both O(A^logiV) as functions of code block- 
length N. 

This theorem is proved in Sections IVIII and IVIIII Notice 
that the complexity bounds in Theorem |5] are independent of 
the code rate and the way the frozen vector is chosen. The 
bounds hold even at rates above I{W), but clearly this has no 
practical significance. 

As for code construction, we have found no low-complexity 
algorithms for constructing polar codes. One exception is the 
case of a BEC for which we have a polar code construc- 
tion algorithm with complexity 0{N). We discuss the code 
construction problem further in Sect. |IX] and suggest a low- 
complexity statistical algorithm for approximating the exact 
polar code construction. 

D. Relations to previous work 

This paper is an extension of work begun in [2], where 
channel combining and splitting were used to show that 
improvements can be obtained in the sum cutoff rate for some 
specific DMCs. However, no recursive method was suggested 
there to reach the ultimate limit of such improvements. 

As the present work progressed, it became clear that polar 
coding had much in common with Reed-Muller (RM) coding 
[3], [4]. Indeed, recursive code construction and SC decoding, 
which are two essential ingredients of polar coding, appear to 
have been introduced into coding theory by RM codes. 

According to one construction of RM codes, for any N = 
2", n > 0, and < K < N, an RM code with block- 
length N and dimension K, denoted RM(A^, K), is defined as 
a linear code whose generator matrix Gum {N, K) is obtained 
by deleting (A^ — K) of the rows of P®" so that none of the 
deleted rows has a larger Hamming weight (number of Is in 
that row) than any of the remaining K rows. For instance, 

'10 0" 



GflM(4,4) = P®2^ 



110 
10 10 

1111 



andGflM(4,2) = [i;iO] 



This construction brings out the similarities between RM 
codes and polar codes. Since Gat and F®" have the same 
set of rows (only in a different order) for any N = 2", it is 
clear that RM codes belong to the class of GAr-coset codes. 
For example, RM(4, 2) is the G4-coset code with parameter 
(4, 2, {2, 4}, (0, 0)). So, RM coding and polar coding may be 
regarded as two alternative rules for selecting the information 
set ^ of a Gjv-coset code of a given size (iV, K). Unlike polar 
coding, RM coding selects the information set in a channel- 
independent manner; it is not as fine-tuned to the channel 
polarization phenomenon as polar coding is. We will show 
in Sect. |X] that, at least for the class of BECs, the RM rule 
for information set selection leads to asymptotically unreliable 
codes under SC decoding. So, polar coding goes beyond RM 
coding in a non-trivial manner by paying closer attention to 
channel polarization. 

Another connection to existing work can be established 
by noting that polar codes are multi-level \u\u + v\ codes, 
which are a class of codes originating from Plotkin's method 
for code combining [5]. This connection is not surprising in 



view of the fact that RM codes are also multi-level |u|u + i;| 
codes [6, pp. 114-125]. However, unlike typical multi-level 
code constructions where one begins with specific small codes 
to build larger ones, in polar coding the multi-level code is 
obtained by expurgating rows of a full-order generator matrix. 
Gat, with respect to a channel-specific criterion. The special 
structure of Gjv ensures that, no matter how expurgation is 
done, the resulting code is a multi-level |?i|M + t)| code. In 
essence, polar coding enjoys the freedom to pick a multi-level 
code from an ensemble of such codes so as to suit the channel 
at hand, while conventional approaches to multi-level coding 
do not have this degree of flexibility. 

Finally, we wish to mention a "spectral" interpretation of po- 
lar codes which is similar to Blahut's treatment of BCH codes 
[7, Ch. 9]; this type of similarity has already been pointed out 
by Forney [8, Ch. 11] in connection with RM codes. From 
the spectral viewpoint, the encoding operation dHJ is regarded 
as a transform of a "frequency" domain information vector 



,JV 



Ui to a "time" domain codeword vector x( . The transform 
is invertible with G^^ = Gn- The decoding operation is 
regarded as a spectral estimation problem in which one is given 
a time domain observation y^, which is a noisy version of x^ , 
and asked to estimate uf . To aid the estimation task, one is 
allowed to freeze a certain number of spectral components of 
Ui . This spectral interpretation of polar coding suggests that 
it may be possible to treat polar codes and BCH codes in a 
unified framework. The spectral interpretation also opens the 
door to the use of various signal processing techniques in polar 
coding; indeed, in Sect. IVIII we exploit some fast transform 
techniques in designing encoders for polar codes. 



E. Paper outline 

The rest of the paper is organized as follows. Sect. HI] 
explores the recursive properties of the channel splitting op- 
eration. In Sect. Un] we focus on how I{W) and Z{W) get 
transformed through a single step of channel combining and 
splitting. We extend this to an asymptotic analysis in Sect. |IV] 
and complete the proofs of Theorem [T] and Theorem |2] This 
completes the part of the paper on channel polarization; the 
rest of the paper is mainly about polar coding. Section |V] 
develops an upper bound on the block error probability of polar 
coding under SC decoding and proves Theorem |3] Sect. |Vl] 
considers polar coding for symmetric B-DMCs and proves 
Theorem]?] Sect. I Vlll gives an analysis of the encoder mapping 
Gat, which results in efficient encoder implementations. In 
Sect. IVIIII we give an implementation of SC decoding with 
complexity O(A^logA^). In Sect. ]IX] we discuss the code 
construction complexity and propose an 0{N log N) statistical 
algorithm for approximate code construction. In Sect. ]X] we 
explain why RM codes have a poor asymptotic performance 
under SC decoding. In Sect. lXIl we point out some generaliza- 
tions of the present work, give some complementary remarks, 
and state some open problems. 

II. Recursive channel transformations 

We have defined a blockwise channel combining and split- 
ting operation by (]4]i and Q which transformed N indepen- 



dent copies of W into Wpj , . . . , W^ . The goal in this sec- 
tion is to show that this blockwise channel transformation can 
be broken recursively into single-step channel transformations. 
We say that a pair of binary-input channels W' : X —^ y 
and W" : X ^ y X X are obtained by a single-step 
transformation of two independent copies of a binary-input 
channel W : X ^ y and write 

{W, W) ^ (W', W") 

iff there exists a one-to-one mapping f : y^ -^ y such that 

W'if{yuy2)\ui) = Y. ^W^(yil"i ® w'2)M^(y2|4), (17) 

W"ifiyi,y2),ui\u2)^^W{yi\ui(Su2)W{y2\u2) (18) 

for all Ml, U2 € X, yi, j/2 e 3^- 

According to this, we can write ( W, W) i-^ (W2 , Wj ) 
for any given B-DMC W because 

w^'\yl\u,)^Y.l^^(y>l) 

«2 

= E|^(yil"i® "2)^^(2/2^2), (19) 



rC^)f..2 



A 



Wr{yi,ui\u2) = -W2{yl\ul) 

= ^W{yi\ui®U2)W{y2\u2), (20) 

which are in the form of (TH and ( fTsT i by taking / as the 
identity mapping. 

It turns out we can write, more generally. 

This follows as a corollary to the following: 

Proposition 3: For any n> 0, N — 2", 1 < z < A^, 

w^';-'\yryr'\u2^-i) = 

E I ^N (2/f , y^o' © -?r>2.-l ® U2.) 
r(i)(„,'2N „,2i-2 



(21) 



w'^>{yii\,y,::'\u2^) (22) 



and 



W^^{yr.ul^-'\U2..)^ 
2 



^ n^(2/f ,<„-^ ® <e"1"2.-l © U2.) 



W^^{yl\,yZ:'\u2.). (23) 



This proposition is proved in the Appendix. The transform 
relationship (jSTT i can now be justified by noting that (i22l i and 
(l23] i are identical in form to ([TtI i and ( [TSl l, respectively, after 
the following substitutions: 



W ^W, 



(■0 



N ' 



w ^W, 



(2i-l) 



w" ^ w. 



m 



2N ' 



2^ 



Ul ^ U2i-1, 



U2 ^ U2i, 



2/i-(yf,<o-^©<r), 



2N „.2i-2\ 



y2 - (y^;i,<;^), fiyi,y2) - (yr,<-^) 



8 


wl'> 


W^2 ^ 


VK 


8 \ 


/vi^) \ 


/"^l^^ 


"""~~Af 
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* \ \ 


//fc>\ 
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Fig. 5. The channel transformation process with N = 8 channels. 



Thus, we have shown that the blockwise channel trans- 
formation from W^ to {Wj^j , . . . , Wj^j ) breaks at a local 
level into single-step channel transformations of the form 
in} . The full set of such transformations form a fabric as 
shown in Fig. 5 for N = 8. Reading from right to left, the 
figure starts with four copies of the transformation {W, W) t-^ 
(W2 , WQ ) ^'^'^ continues in butterfly patterns, each repre- 
senting a channel transformation of the form (W^^ , W^i ) 1— > 
(W^i^~ ,W^i^l). The two channels at the right end-points 
of the butterflies are always identical and independent. At 
the rightmost level there are 8 independent copies of W; 
at the next level to the left, there are 4 independent copies 



r(l) 



r(2) 



of W2 and W2 each; and so on. Each step to the left 
doubles the number of channel types, but halves the number 
of independent copies. 



III. Transformation of rate and reliability 

We now investigate how the rate and reliability parameters, 

I(Wn ) ^'^'^ ^(Wn )^ change through a local (single-step) 

transformation (I2TI 1. By understanding the local behavior, we 

will be able to reach conclusions about the overall transfor- 



mation from W to {W^ 



N 



.i^D 



Proofs of the results 



in this section are given in the Appendix. 

A. Local transformation of rate and reliability 

Proposition 4: Suppose {W, W) 1-^ {W , W") for some set 
of binary-input channels. Then, 

I{W') + I{W") = 2I{W), (24) 

I{W') < I{W") (25) 

with equality iff I{W) equals or 1. 



The equality (l24l l indicates that the single-step channel 
transform preserves the symmetric capacity. The inequality 
(|25] | together with (l24l l implies that the symmetric capacity 
remains unchanged under a single-step transform, I{W') = 
I(^W") = I{W), iff W is either a perfect channel or a 
completely noisy one. If W is neither perfect nor completely 
noisy, the single-step transform moves the symmetric capacity 
away from the center in the sense that I{W') < I{W) < 
I{W"), thus helping polarization. 

Proposition 5: Suppose {W, W) 1-^ {W, W") for some set 
of binary-input channels. Then, 

Z{W") = Z{Wf, (26) 

Z{W') <2Z{W)~Z{W)^, (27) 

Z{W') > Z{W) > Z{W"). (28) 

Equality holds in ^ iff VF is a BEC. We have Z{W') = 
Z{W") iff Z{W) equals or 1, or equivalently, iff I{W) 
equals 1 or 0. 

This result shows that reliability can only improve under a 
single-step channel transform in the sense that 

Z{W') + Z{W") < 2 Z{W) (29) 

with equality iff W is a BEC. 

Since the BEC plays a special role w.r.t. extremal behavior 
of reliability, it deserves special attention. 

Proposition 6: Consider the channel transformation 
{W,W) ^ (W',W"). If W is ci BEC with some erasure 
probability e, then the channels W' and W" are BECs with 
erasure probabilities 2e — e^ and e^, respectively. Conversely, 
if W or W" is a BEC, then W is BEC. 

(i) 

B. Rate and reliability for Wjq 

We now return to the context at the end of Sect. HI] 
Proposition 7: For any B-DMC W, N = 2", n > 0, 1 < 
i < N, the transformation {W^^\wj^'^) i-> {W^n'^\w^^^^) 
is rate-preserving and reliability-improving in the sense that 

I{W^N-'^) + I{W^f) = 2 /(<*)), (30) 

Z{W^T'^) + Z{wf^) < 2 Z(W-«), (31) 

with equality in iT\} iff W^ is a BEC. Channel splitting moves 
the rate and reliability away from the center in the sense that 

/(W^iA'^) < HWJ^^) < I{W^P), (32) 

Z{W^N~^^) > Z(W^«) > Z{W^f), (33) 

with equahty in ^ and ^ iff I{W) equals or 1. The 
reliability terms further satisfy 

ZiW^N^^'') < 2Z{WJ^^) - ZiWJ^^)^, (34) 

ZiW^P) = ZiW^^f, (35) 

with equality in ( [34] | iff W^ is a BEC. The cumulative rate and 
reliability satisfy 



N 



i=l 

N 

^ZiW^^^)<NZ{W), 



(36) 

(37) 



i=l 



with equality in ^ iff W is a BEC. D 

This resuh follows from Prop. |4] and Prop. |5] as a special 
case and no separate proof is needed. The cumulative relations 
( |36] | and (|37] | follow by repeated application of (l30l l and OTT l. 
respectively. The conditions for equality in Prop. |4] are stated 

(i) 

in terms of W rather than W^ ; this is possible because: (i) 
by Prop. 12 I{W) G {0, 1} iff /(W^^*') G {0, 1}; and (ii) W 
is a BEC iff W}^' is a BEC, which follows from Prop. |6] by 
induction. 

For the special case that W is a BEC with an erasure 
probability e, it follows from Prop. |4] and Prop. |6] that the 
parameters {Z{Wj^ )} can be computed through the recursion 



Z{WJ^'-'^) = 2Z{WI^J^ 



z{w^],)\ 



z(n"0 = Z[W^^i^)\ 



(38) 



with Z{W\ ) = £■ The parameter Z(T4^^ ) equals the erasure 
probability of the channel W^ . The recursive relations (|6]l 
follow from (ESj by the fact that I(yi^^^) = 1 - Z{w''^^) for 
W a BEC. 



-W<'' = Wooo 



M/'^' = M^oo 



W^a'^' = W'o 



-W^f ' = Wool 



-lyf ' = W^oio 



/(2) 



W ' = W^oi 



H^ 



-W^<*' = MKoii 



-lyi"' = W^ioo 



H^i"' = W^io 



H/f ' = m 



-w(«) = H^ioi 



-lyf ' = Who 



/(4) 



wr = vKii 



-H/f ' = Will 



IV. Channel polarization 

We prove the main results on channel polarization in this 
section. The analysis is based on the recursive relationships 
depicted in Fig. 5; however, it will be more convenient to re- 
sketch Fig. 5 as a binary tree as shown in Fig. 6. The root 
node of the tree is associated with the channel W . The root 
W gives birth to an upper channel Wj ^"d a lower channel 

(2) 

W2 , which are associated with the two nodes at level 1. The 
channel Wj^ in turn gives birth to the channels W4 and 
W\ , and so on. The channel W2„ is located at level n of 
the tree at node number i counting from the top. 

There is a natural indexing of nodes of the tree in Fig. 6 by 
bit sequences. The root node is indexed with the null sequence. 
The upper node at level 1 is indexed with and the lower 
node with 1. Given a node at level n with index h\hi ■ ■ ■ bn, 
the upper node emanating from it has the label &1&2 ■ • • bnO 
and the lower node 6162 ■ • -^nl- According to this labeling, 
the channel W^J is situated at the node &162 ■ • ■ bn with i — 
1 + S?=i bj2^^^ . We denote the channel VF2„ located at node 
6162 ■ ■ -bn alternatively as Wbi...b„- 

We define a random tree process, denoted {A'„; n > 0}, 
in connection with Fig. 6. The process begins at the root of 
the tree with Ko = W. For any n > 0, given that Kn = 
Wbi...b„, Kn+i equals Wb^-.-b^a or Wb^-.-b^i with probability 
1/2 each. Thus, the path taken by {Kn} through the channel 
tree may be thought of as being driven by a sequence of i.i.d. 
Bernoulli RVs {i3„;n ~ 1,2,...} where Bn equals or 1 
with equal probability. Given that Bi, . . . , Bn has taken on 
a sample value &i , . . . , 6„, the random channel process takes 
the value Kn = Wbi---b„- In order to keep track of the rate 
and reliability parameters of the random sequence of channels 
Kn, we define the random processes /„ = I{Kn) and Z„ = 

Z{Kn). 

For a more precise formulation of the problem, we consider 
the probability space {il, 3^, P) where il is the space of all 



Fig. 6. The tree process for the recursive channel construction. 



binary sequences (61, 62, • . .) G {0, 1}°°, 3^ is the Borel field 
(BE) generated by the cylinder sets S{bi, . . . ,bn) = {w G 
n : uji = bi,...,ujn = &n}, n > 1, 61,..., 6„ G {0,1}, 
and P is the probability measure defined on 3^ such that 
P{S{bi, ..., bn)) = 1/2". For each n > 1, we define ?„ as 
the BF generated by the cylinder sets S{bi, . . . , bi), 1 < i < n, 
bi, . . . ,bi G {0, 1}. We define S'q as the trivial BF consisting 
of the null set and il only. Clearly, % C Hi C ■ ■ ■ C 3^. 

The random processes described above can now be formally 
defined as follows. For oj = {uji,uj2, ■ ■ ■) G 51 and n > 1, 
define Bn{uj) = w„, Kn{uj) = W^,...^^, /„(w) = I{Kn{uj)), 
and Zn{uj) = Z(X„(w)). For n = 0, define Kq = W, Iq = 
I{W), Zq = Z{W). It is clear that, for any fixed n > 0, the 
RVs Bn, Kn, In, and Z„ are measurable with respect to the 
BF J„. 



A. Proof of Theorem |7] 

We will prove Theorem [T] by considering the stochastic 
convergence properties of the random sequences {/„} and 

{Zn}. 

Proposition 8: The sequence of random variables and Borel 
fields {/„, 3^„; n > 0} is a martingale, i.e., 

S^n C J'n+i and /„ is J'n-measurable, (39) 

E[\In\]<^, (40) 

In = E[In+l\'Jn]- (41) 

Furthermore, the sequence {/„;"- > 0} converges a.e. to a 
random variable I^o such that E[Ioo\ — lo- 

Proof Condition ( l39l l is true by construction and (l40l ) 
by the fact that < /„ < 1. To prove WH . consider a cylinder 



set S{bi, . . . , bn) € 3^n and use Prop. |7]to write 



E[In+i\S{bi,--- ,bn) 






Since I{Wbj^...b„) is the value of /„ on S'(6i, . . . , 6„), fiTT i 
follows. This completes the proof that {/„,?„} is a martin- 
gale. Since {/„,?„} is a uniformly integrable martingale, by 
general convergence results about such martingales (see, e.g., 
[9, Theorem 9.4.6]), the claim about loo follows. ■ 

It should not be surprising that the limit RV loo takes values 
a.e. in {0, 1}, which is the set of fixed points of I{W) under 
the transformation (14^, VF) i-^ (Wj , Wj )' ^^ determined 
by the condition for equality in dZSl l. For a rigorous proof 
of this statement, we take an indirect approach and bring the 
process {Z„;n > 0} also into the picture. 

Proposition 9: The sequence of random variables and Borel 
fields {Zn,J'n',n > 0} is a supermartingale, i.e., 

3^n C J'n+i and Z„ is J'n-measurable, (42) 

E[\Z^\] < oo, (43) 

Zn > E[Zn+l\Jn]. (44) 

Furthermore, the sequence {Z„;n > 0} converges a.e. to a 
random variable Zoo which takes values a.e. in {0, 1}. 

Proof: Conditions ( l42b and ( l43T l are clearly satisfied. To 
verify (l44l) . consider a cylinder set S{bi, . . . , 6„) € J'n and 
use Prop. |7] to write 

E[Zr,+i\Sibi, . . . , 6„)] = ^^(W^6i-6„o) + ^Z(Wb,...b„l) 

<ZiWb,...bJ. 

Since Z(W6i...b„) is the value of Z„ on S{bi, . . . ,5„), (l44l l 
follows. This completes the proof that {Zn,%i} is a super- 
martingale. For the second claim, observe that the supermartin- 
gale {Zn, 3^n} is uniformly integrable; hence, it converges a.e. 
and in C^ to a RV Zoo such that E[\Zn - ^oo|] ^ (see, 
e.g., [9, Theorem 9.4.5]). It follows that £'[|Z„+i - Z„|] -^ 0. 
But, by Prop. |7] Z„+i = Z^ with probability 1/2; hence, 
E[\Zn+i~Zn\] > (1/2)^[Z„(1-Z„)] > 0. Thus, E[Zn{l^ 
Zn)] -^ 0, which implies E[Zoo{^ — Zoo)] — 0. This, in turn, 
means that Zoo equals or 1 a.e. ■ 

Proposition 10: The limit RV loo takes values a.e. in the 
set {0, 1}: P{Ioo = 1) = /o and P{Ioo - 0) = 1 - /q. 

Proof: The fact that Zoo equals or 1 a.e., combined 
with Prop.[T] implies that loo — ^—Zoo a.e. Since E[Ioo] — Iq, 
the rest of the claim follows. ■ 

As a corollary to Prop.[Tol we can conclude that, as N tends 
to infinity, the symmetric capacity terms {I{W^ : 1 < « < 
N} cluster around and 1, except for a vanishing fraction. 
This completes the proof of Theorem [T] 

It is interesting that the above discussion gives a new 
interpretation to /q = I{W) as the probability that the random 
process {Zn;n > 0} converges to zero. We may use this to 
strengthen the lower bound in ([T]i. (This stronger form is given 
as a side result and will not be used in the sequel.) 

Proposition 11: For any B-DMC W, we have I{W) + 
Z{W) > 1 with equahty iff H/ is a BEC. 



This result can be interpreted as saying that, among all B- 
DMCs W, the BEC presents the most favorable rate-reliability 
trade-off: it minimizes Z{W) (maximizes reliability) among 
all channels with a given symmetric capacity I{W)\ equiva- 
lently, it minimizes I{W) required to achieve a given level of 
reliabihty Z{W). 

Proof: Consider two channels W and W with Z{W) = 
Z{W') ^ zq. Suppose that W is a BEC. Then, W has 
erasure probability zq and I{W') = 1 — zq. Consider the 
random processes {Zn} and {Z^} corresponding to W and 
W', respectively. By the condition for equality in ( |34| |. the 
process {Zn} is stochastically dominated by {Z^} in the sense 
that P{Zn < z) > P{Z'„ < z) for all n > 1, < z < 1. 
Thus, the probability of {Z,,} converging to zero is lower- 
bounded by the probability that {Z'^-^} converges to zero, i.e., 
I{W) > I{W'). This implies I{W) + Z{W) > 1. ■ 

B. Proof of Theorem |2] 

We will now prove Theorem IH which strengthens the 
above polarization results by specifying a rate of polarization. 
Consider the probability space {Vt,'J,P). For tj G 51, i > 0, 
by Prop. Ill we have Zi^i{uj) — Zf{uj) if Bi^i{uj) — 1 and 
Zi+i{uj) < 2Zi{uj) - Z,{uj)^ < 2Zi(J) if -Bi+i(w) = 0. For 
C > and TO > 0, define 

7;„(C) = {cj e rj : Z,{uj) < C for all i > m}. 
For uj E %n{C) and i > m, we have 



which implies 

For n > 77?> > and < r/ < 1/2, define 



n (C/2)^'("\ c.eT™(C), n>m. 



U,n,M = {^^^-- J2 B,{Lu)>{l/2-7j)in^m)}. 



i— m+l 



Then, we have 



Znico) < C ■ 



2i+''C- 



to e%niO CMAm,n{r])\ 



from which, by putting Co = 2 '^ and 7/0 = 1/20, we obtain 

Zn{u^) < 2-4-5(«-")/4, ^ e r™(Co) n Um.nim)- (45) 

Now, we show that ( |45] l occurs with sufficiently high proba- 
bility. First, we use the following result, which is proved in 
the Appendix. 

Lemma 1: For any fixed C > 0, (5 > 0, there exists a finite 
integer ?7io(C,<5) such that 

P[%n,{C)]>h-5/2. 

Second, we use Chernoff's bound [10, p. 531] to write 

P UnAv)] > 1 - 2-("-™)[l-^(l/2-^)] (4g) 
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where H is the binary entropy function. Define no{m, rj, S) as 
the smallest n such that the RHS of ( |46] l is greater than or 
equal to 1 — (5/2; it is clear that no{m,7],6) is finite for any 
m > 0, < 7^ < 1/2, and S > 0. Now, with mi = mi{d) = 
"^o(Coi^) and ni — ni{5) — no{mi,r]o,6), we obtain the 
desired bound: 

P[%nACo) (^l^nn^niVo)] > h - S, 11 > Tli. 

Finally, we tie the above analysis to the claim of Theorem |2] 

Define c = 2-4+5™i/4 and 

and, note that 

So, PiVn) > Iq — S for n > ni. On the other hand. 



N 



where ^tv = {« € {1, . . . ,7V} : Z{W}^>) < cN'^/"^} with 
iV = 2". We conclude that \An\ > N{Ia - S) for n > ni{6). 
This completes the proof of Theorem |2] 

Given Theorem |2] it is an easy exercise to show that polar 
coding can achieve rates approaching I{W), as we will show 
in the next section. It is clear from the above proof that 
Theorem |2] gives only an ad-hoc result on the asymptotic rate 
of channel polarization; this result is sufficient for proving a 
capacity theorem for polar coding; however, finding the exact 
asymptotic rate of polarization remains an important goal for 
future researcho 

V. Performance of polar coding 

We show in this section that polar coding can achieve 
the symmetric capacity I{W) of any B-DMC W. The main 
technical task will be to prove Prop. |2] We will carry out the 
analysis over the class of GAr-coset codes before specializing 
the discussion to polar codes. Recall that individual Gjy-coset 
codes are identified by a parameter vector {N,K,A,u^a). 
In the analysis, we will fix the parameters (iV, K, A) while 
keeping w^c free to take any value over X^~^. In other 
words, the analysis will be over the ensemble of 2^^^ Gn- 
coset codes with a fixed {N, K, A). The decoder in the system 
will be the SC decoder described in Sect. II-C.2I 

A. A probabilistic setting for the analysis 

Let {X^ X y^ , P) be a probability space with the proba- 
bility assignment 



P{{{u'^,vT)})^2-^Wn{y?\u^) 



(47) 



for all {ui ,yf) e X'^ x 3^^. On this probabihty space, we 
define an ensemble of random vectors {U^ ,Xf ,Yf^ ,U{^) 
that represent, respectively, the input to the synthetic channel 

^A recent result in this direction is discussed in Sect. IXI-AI 



Wn, the input to the product-form channel W^ , the output of 
W^ (and also of Wn), and the decisions by the decoder For 
each sample point (u^ , yf ) E X^ xy^ , the first three vectors 
take on the values C/f(uf,yf) = wf, X^{u^,y^) = 
u^Gn, and Y^{u^,y^) = y^, while the decoder output 
takes on the value Ui (ui , yf) whose coordinates are defined 
recursively as 



c/.«,yf) = 



Ui, i e A'^ 

h,iy^,Ul-\u^,y^)), zeA 



(48) 



for i = l,...,iV. 

A realization u^ E X^ for the input random vector U^ 
corresponds to sending the data vector uj( together with the 
frozen vector u^c. As random vectors, the data part Ua 
and the frozen part f/^c are uniformly distributed over their 
respective ranges and statistically independent. By treating 
C/^c as a random vector over X^^^ , we obtain a convenient 
method for analyzing code performance averaged over all 
codes in the ensemble {N^K,A). 

The main event of interest in the following analysis is the 
block error event under SC decoding, defined as 

f = {«,yf ) e A-^ X :y^ : C/^«,yf ) ^ uj,}. (49) 

Since the decoder never makes an error on the frozen part of 
Ui , i.e., C/^c equals C/4C with probability one, that part has 
been excluded from the definition of the block error event. 

The probability of error terms P^. {N, K, A) and 
PeiN,K,A,uj,c) that were defined in Sect. II-C.3I can 
be expressed in this probability space as 



Pe{N,K,A)^P{£), 
Pe{N,K,A,UA^) = P{£ I {Ua^ = UA^}), 

where {Ua'= = uai^} denotes the event {(uf ,yf ) G X 

y'^ ■ UA- = UA-}. 



(50) 



^N 



B. Proof of Proposition \2\ 

We may express the block error event as £ = Ui^A^i where 

s.-lK-OeA-^xj;^: 

<-i = Ur\u^,y^), u, ^ L/,K,2/f )} (51) 

is the event that the first decision error in SC decoding occurs 
at stage i. We notice that 

B, = {K,yf ) EX^xy^ : <-i = ;7ri«,yf ), 
u,^h.,{y^,Ul\u'^,y^))} 
= {«,yf ) EX^xy^ : <-i = Ul-\u^,y^\ 

u.,^h{y^,u[-')) 
C {(«f , yf ) e A-^ X 3;^ : H, ^ h,{y^y,-')} C £. 

where 

f. = {«,2/f ) EX'^xy^ : wt'\y?y,-' I u,) 

<w'^-'\y':^y{-^\u,®l)}. (52) 
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Thus, we have 

fcljf,, P{E) <Y,P{£{). 

For an upper bound on P{£i), note that 



< E ^w^ivil-i)^ 



Wl;\y^,u\-'\U.(B1) 



Z(<^)) 



We conclude that 



(53) 



P(£)<^Z(<'), 
ieA 

which is equivalent to ( fT3] l. This completes the proof of 
Prop. |2] The main coding theorem of the paper now follows 
readily. 

C. Proof of Theorem \3\ 

By Theorem|2] for any given rate R < I{W), there exists a 
sequence of information sets An with size \An\ > NR such 
that 

E Z(Wn) < N niax{Z(W^«)} - O(iV-i). (54) 
leAiv 

In particular, the bound (|54] | holds if ^jv is chosen in 
accordance with the polar coding rule because by definition 
this rule minimizes the sum in ( l54l i. Combining this fact about 
the polar coding rule with Prop. |2l Theorem [5] follows. 

D. A numerical example 

Although we have established that polar codes achieve the 
symmetric capacity, the proofs have been of an asymptotic 
nature and the exact asymptotic rate of polarization has not 
been found. It is of interest to understand how quickly the 
polarization effect takes hold and what performance can be 
expected of polar codes under SC decoding in the non- 
asymptotic regime. To investigate these, we give here a nu- 
merical study. 

Let VF be a BEC with erasure probability 1/2. Figure 7 
shows the rate vs. reliability trade-off for W using polar codes 
with block-lengths N e {2i°, 2^^, 2'^°}. This figure is obtained 
by using codes whose information sets are of the form A{ri) = 
{i e {l,...,N} : Z{W^^'^) < ri}, where < 77 < 1 is a 
variable threshold parameter There are two sets of three curves 
in the plot. The solid lines are plots of R{ri) — |^(?7)|/A^ vs. 
B{r]) = J2ieA{r,) ^(W^jv^)- The dashed lines are plots of R{t]) 

vs. L{r]) — maxig_4(^){Z(VKj!^ )}. The parameter rj is varied 
over a subset of [0, 1] to obtain the curves. 

The parameter R{ri) corresponds to the code rate. The 
significance of B{ri) is also clear: it is an upper-bound on 
Peirj), the probability of block-error for polar coding at rate 
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Fig. 7. Rate vs. reliability for polar coding and SC decoding at block-lengths 
210, 2i5, and 2^0 on a BEC with erasure probability 1/2. 



Rirj) under SC decoding. The parameter L{ri) is intended to 
serve as a lower bound to Peiv)- 

This example provides empirical evidence that polar coding 
achieves channel capacity as the block-length is increased — a 
fact already established theoretically. More significantly, the 
example also shows that the rate of polarization is too slow to 
make near-capacity polar coding under SC decoding feasible 
in practice. 

VI. Symmetric channels 

The main goal of this section is to prove Theorem |4l 
which is a strengthened version of Theorem [3] for symmetric 
channels. 

A. Symmetry under channel combining and splitting 

LelW : X ^yhe n symmetric B-DMC with A" = {0, 1} 
and y arbitrary. By definition, there exists a a permutation tti 
on y such that (i) Trf ^ = m and (ii) W{y\l) = W{Tri{y)\0) 
for all y (z y. Let ttq be the identity permutation on y. 
Clearly, the permutations (ttq, tti) form an abelian group under 
function composition. For a compact notation, we will write 
X ■ y to denote TT^iy), for x ^ X, y G y. 

Observe that W{y\x (B a) — W{a ■ y\x) for all a,x £ X, 
y E y. This can be verified by exhaustive study of possible 
cases or by noting that Vl^(y|a: ® a) = W{{x ® a) ■ y\0) — 
W{x-{a-y)\0) = W{a-y\x). Also observe that W{y\x®a) = 
W{x ■ y\a) as © is a commutative operation on X. 

For xf e X^, 2/f e y^, let 

,N A 



„JV 



•2/1 ^ixi-yi,...,XN -yjv)- 



(55) 



This associates to each element of X'^ a permutation on y'^ . 
Proposition 12: If a B-DMC W is symmetric, then W'^ is 
also symmetric in the sense that 



forall<,af G A-^, 2/f e3^^. 
The proof is immediate and omitted. 



(56) 



12 



Proposition 13: If a B-DMC W is symmetric, then the for any fixed a; j G X . The rest of the proof is immediate, 
channels Wn and W^ are also symmetric in the sense that 



W^jv(yf I O = WN{a':!GN • 2/f K 



<), (57) 






W);' (af Gat • yf , ^"^ ® al"^ \u,®a^) (58) 
for all uf , af e A"^, yf g j;^, iV = 2", n > 0, 1 < i < iV. 



Proof: Let a;f^ = u^Gn and observe that W^Ar(j/f 



JV 



Af 



f) = n:Lim2/ 



rAf 



IYUW{x,-y, \0) = WN{xf 



N 



Hi I Oi )• Now, let bi = a^ Gn, and use the same reasoning 
to see that VFw(6f • yf | wf ® af ) = VFAr((xf ® 6f ) • (&f • 
yf ) I Of ) = W^Ar(^f • 2/f I Of ). This proves the first claim. 
To prove the second claim, we use the first result. 



(*)/',,JV „,i-l 



W'^'iy- 



1 : "1 



= E i^^(2/f I O 



Eiw^^^«G^^■yi'K®0 



W^Ar(af Gat • yf , <"' a\-' \u,(Ba,) 



where we used the fact that the sum over Uj^^i S X * can 
be replaced with a sum over uf^j^ af^j^ for any fixed af 
since {ufi^ ® a,^i : uf^^ G A"^-'} = X^-\ ■ 

B. Proof of Theorem |4] 

We return to the analysis in Sect. [V] and consider a code 
ensemble (A^, if, A) under SC decoding, only this time as- 
suming that VF is a symmetric channel. We first show that the 
error events {£i} defined by ( l52l ) have a symmetry property. 

Proposition 14: For a symmetric B-DMC W , the event £i 
has the property that 

«,yf)e£, iff (af ®<,afGw-yf)e£, (59) 

for each l<i<N, (wf ,yf ) e X^ x y^ , af e A"^. 

Proof: This follows directly from the definition of £i by 
using the symmetry property (|58] | of the channel W^^ . ■ 

Now, consider the transmission of a particular source vector 
w^ and a frozen vector w^c, jointly forming an input vector 
wf for the channel Wn- This event is denoted below as 
{Ui = uf } instead of the more formal {uf } x y^ . 

Corollary 1: For a symmetric B-DMC VF, for each 1 < 
i < A^ and uf e A"^, the events £, and {t/f = uf } are 
independent; hence, P(£^) = P(£, \ {C/f = wf }). 

Proof: For «,yf) € A"^ x 3^^ and xf = u^Gat, 
we have 

P{£. I {[/f =<}) ^^W^^(yf I O l,.K,yf) 
= 5] W^^« • yf I Of) 1,, (Of , xf • yf ) (60) 



= F(f. |{C/f =0f}). 



(61) 



Equality follows in ( |60] l from ( |57] i and (|59] l by taking a^ ~ 
wf , and in dSB from the fact that {x^ -y^ : yf e 3^^} = 3^^ 



Now, by (|53]l, we have, for all uf e A"^, 

P(£, K^f = <})<^(<^^) 
and, since £ C Uig^ £i, we obtain 

P{£\{U^^u^})<Y,Z{W^ 



f) 



(62) 



(63) 



iSy^ 



This implies that, for every symmetric B-DMC W and every 
{N,K,A,UA'^) code, 

P,{N,K,A,UA^)^ E ^^(^ I {t^i^^ = <}) 

<^Z(<^)). (64) 

This bound on Pe {N, K, A, ua^ ) is independent of the frozen 
vector w^c . Theorem |4] is now obtained by combining Theo- 
rem |2] with Prop. 12] as in the proof of Theorem [3] 

Note that although we have given a bound on P{£\{U^ = 
wf }) that is independent of uf , we stopped short of claiming 
that the error event £ is independent of U{^ because our 
decision functions {hi} break ties always in favor of iii — 0. 
If this bias were removed by randomization, then £ would 
become independent of U(^ . 



C. Further symmetries of the channel W^ 



(i) 

N 



We may use the degrees of freedom in the choice of a^ in 



to explore the symmetries inherent in the channel W^ 



N ■ 



For a given {y^ ,u\), we may select a^ with a\ = u\ to 
obtain 

<^^(yf , u\-' I u.) = WJ^\a^GN ■ y^, Or^ I 0). (65) 

So, if we were to prepare a look-up table for the transition 
probabilities {WJ^\yf ,u[-^ \ u,) : yf e y^ ,u\ e A*}, 
it would suffice to store only the subset of probabilities 
{<*^(yf,0rM0):yf €3^^}. 

The size of the look-up table can be reduced further by 
using the remaining degrees of freedom in the choice of af_^i. 

Let A"/^! = {a^ G X^ : a\ = 0\}, 1 < i < N. Then, for 
any 1 < i < A^, af e A^^, and y^ e 3^^, we have 

M^«(yf, 0^-1 10) = wl;\a^GN ■ V^ ,0l'\0) (66) 

which follows from ( l65l l by taking u\ — 0\ on the left hand 
side. 

To explore this symmetry further, let A'jf j^ • yf — {q^Gn ■ 
y^ -.a^ e A^f J. The set X^^ ■ y^ is the orbit of y^ under 
the action group Xj^^. The orbits A'jf ^ • y^ over variation 
of y^ partition the space 3^^ into equivalence classes. Let 
yKi be a set formed by taking one representative from each 

(i) 

equivalence class. The output alphabet of the channel Wj^ 
can be represented effectively by the set y,j^i. 

For example, suppose VF is a BSC with y = {0, 1}. Each 
orbit Aj^j^ • y^ has 2^^' elements and there are 2* orbits. 
In particular, the channel W^ has effectively two outputs, 
and being symmetric, it has to be a BSC. This is a great 
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simplification since Wj^ has an apparent output alphabet size 
of 2^. Likewise, while W^ has an apparent output alphabet 
size of 2^+*~^, due to symmetry, the size shrinks to 2*. 

Further output alphabet size reductions may be possible by 
exploiting other properties specific to certain B-DMCs. For 
example, if VF is a BEC, the channels {Wjy } are known to 
be BECs, each with an effective output alphabet size of three. 

The symmetry properties of {Wjy } help simplify the com- 
putation of the channel parameters. 

Proposition 15: For any symmetric B-DMC W, the pa- 
rameters {Z{W^ )} given by (|7]i can be calculated by the 
simplified formula 



z(w^«) = 2-i Y. l^Hi-y^l 



We omit the proof of this result. 

For the important example of a BSC, this formula becomes 



Ui 



U2 



Un/2 



y^ey^'+i ^ N/2+i 

■ ^/wj^\y^,0\-'\0)W^^\y^,0\-'\l). 



Un/2+2 



■ Y. ^/w^\y^,0\-'\0)W^\y^,0\-'\l). uj^ 




yjv/2+1 



yN/2+2 



This sum for Z{Wj^ ) has 2* terms, as compared to 2^+* ^ Wn 

terms in i\Jj. pjg g ^j^ alternative realization of the recursive construction for Wpf. 



VII. Encoding 

In this section, we will consider the encoding of polar codes 
and prove the part of Theorem |5] about encoding complexity. 
We begin by giving explicit algebraic expressions for Gn, 
the generator matrix for polar coding, which so far has been 
defined only in a schematic form by Fig. 3. The algebraic 
forms of Gn naturally point at efficient implementations of the 
encoding operation x^ = u^Gn- In analyzing the encoding 
operation Gjv, we exploit its relation to fast transform methods 
in signal processing; in particular, we use the bit-indexing idea 
of [11] to interpret the various permutation operations that are 
part of Gn- 

A. Formulas for Gn 

In the following, assume N = 2" for some n > 0. Let Ik 
denote the fc-dimensional identity matrix for any fc > 1. We 
begin by translating the recursive definition of Gn as given 
by Fig. 3 into an algebraic form: 

Gn = {In/2^F)Rn{I2<E)Gn/2), forAr>2, 

with Gi ~ Ii- 

Either by verifying algebraically that {In/2 ^ F)Rn = 
Rn{F ^ In/2) or by observing that channel combining oper- 
ation in Fig. 3 can be redrawn equivalently as in Fig. 8, we 
obtain a second recursive formula 



Gat = i?jv(F In/2){I2 «) Gn/2) 

^Rn{F(E)Gn/2), 



valid for N > 2. This form appears more suitable to derive a 
recursive relationship. We substitute Gn/2 — Rn/2{F®Gn/4) 
back into ( |67] ) to obtain 



Gn ^Rn{F^ {Rn/2 {F ® GAr/4))) 

= Rn {h ® Rn/2) {F®^ ® G, 



(68) 



^N/2) \r- 'OS '^N/i) 

where (l68b is obtained by using the identity {AG) ® {BD) = 
{A (g) B){G (g) D) with A = h, B = Rn/2, C = F, D = 
F (g) Gn/a- Repeating this, we obtain 

Gn - B«F®" 



where Bn = Rn{I2 ^ Rn/2){Ia ® Rn/a) ■ 
can seen by simple manipulations that 



Bn = Rn{I2®B 



N/2) 



(69) 

■{In/2®R2)- It 



(70) 



(67) 



We can see that Bn is a permutation matrix by the following 
induction argument. Assume that Bn/2 is a permutation matrix 
for some A^ > 4; this is true for TV = 4 since B2 ~ h- Then, 
Bn is a permutation matrix because it is the product of two 
permutation matrices, Rn and I2 ^ Bn/2- 

In the following, we will say more about the nature of Bn 
as a permutation. 

B. Analysis by bit-indexing 

To analyze the encoding operation further, it will be conve- 
nient to index vectors and matrices with bit sequences. Given 
a vector a'^ with length N = 2" for some n > 0, we denote 
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its ith element, a^, 1 < i < A^, alternatively as at^^.-.b^ where 
&i • • • 6„ is the binary expansion of the integer i— 1 in the sense 
that 1 = 1 + X^iLi ^j2"^-'. Likewise, the element Aij of an 
A^-by-iV matrix A is denoted alternatively as Ai,-^...i,^_b' ...f 
where 6i • • • &„ and b[- ■ ■ b[^ are the binary representations 
of i — 1 and j — 1, respectively. Using this convention, it 
can be readily verified that the product C = A ® B of a 
2"-by-2" matrix A and a 2™-by-2™ matrix B has elements 

We now consider the encoding operation under bit-indexing. 
First, we observe that the elements of F in bit-indexed form 
are given by Fb^b' ^ l®b' ®bb' for all b, b' e {0, 1}. Thus, 
F®" has elements 



-^bi---b„,6'j---b^ 



l[Pb,,K^l[{l®K®b,b'^. (71) 



Second, the reverse shuffle operator Rn acts on a row vector 
uf to replace the element in bit-indexed position &i • • • &„ with 
the element in position 62 • ■ ■ bnbi; that is, if v^ = u^ Rn, 
then Vbi-.-b,, — Ub2---b„bi for aU foi, . . . , 6„ G {0, 1}. In other 
words, Rn cyclically rotates the bit-indexes of the elements 
of a left operand uf to the right by one place. 

Third, the matrix Bn in ( |69] l can be interpreted as the bit- 
reversal operator: if v^ = ufB^, then Wbi-b„ = ""bn-bi 
for all bi, . . . ,bn G {0, 1}. This statement can be proved by 
induction using the recursive formula (|70] i. We give the idea 
of such a proof by an example. Let us assume that B4 is a 
bit-reversal operator and show that the same is true for Bg. 
Let uf be any vector over GF{2). Using bit-indexing, it can 
be written as (uqoo, uooi, uoio, uoii, uioo, uioi, "no, "in)- 
Since u^Bs = ufRs{l2 <Xi B4), let us first consider the 
action of Rs on uf. The reverse shuffle Rs rearranges the 
elements of wf with respect to odd-even parity of their indices, 
so ufRs equals (uooo, uoio, uioo, uiio, uooi, uoii, "loi, "in)- 
This has two halves, cf — (uooo,uoiO: ^loo, uno) and 
df = (M0017 Moil, Mioi, Mill), corresponding to odd-even in- 
dex classes. Notice that Cb^bi — "61620 and db^b^ — Ub^b2i for 
all 61,62 G {0,1}- This is to be expected since the reverse 
shuffle rearranges the indices in increasing order within each 
odd-even index class. Next, consider the action of I2 ® B4 on 
{ci,df). The result is (c|_B4, (ifi?4). By assumption, B4 is a 
bit-reversal operation, so cfB4 — (coo,cio, coi,cii), which 
in turn equals (uooo,mioo, uoio, uno). Likewise, the result 
of dfBi equals (uooi,mioi,moii,uiii). Hence, the overall 
operation Bg is a bit-reversal operation. 

Given the bit-reversal interpretation of B^, it is clear that 
Bn is a symmetric matrix, so _B^ = Bn- Since Bn is a 
permutation, it follows from symmetry that i?^ ~ Bn- 

It is now easy to see that, for any A^-by-A^ matrix A, 
the product C = BJ^ABn has elements C( 



■6i,6;^---b;- 



bi---6„,6'j---6;^ — 

It follows that if A is invariant under bit- 



reversal, i.e., if A 



6i---b„,b'---b' 



- Ab 



■61,6' 



for every 



61,..., 6„, &;,..., 6; e {0,1}, then A = BJ^ABn- Since 



B^ - B 



N 



this is equivalent to B^A = ABt- Thus, 
bit-reversal-invariant matrices commute with the bit-reversal 
operator. 



Proposition 16: For any A^ = 2", n > 1, the generator 
matrix Gat is given by Gat = ^Ari^®" and Gat = F®"Bjv 
where Bn is the bit-reversal permutation. Gat is a bit-reversal 
invariant matrix with 

n 

(Gjv)6i...b„,b;...b^ = n(l ® ^^ ® ^"-'^'■)- ^^2) 

i=l 

Proof: F®" commutes with Bn because it is invari- 
ant under bit-reversal, which is immediate from (ItTI i. The 
statement GAf = BnF®^ was established before; by proving 
that F®" commutes with Bn, we have established the other 
statement: Gat = F®'"'Bn- The bit-indexed form jT^l l follows 
by applying bit-reversal to JTTI l. ■ 

Finally, we give a fact that will be useful in Sect. 1X1 
Proposition 17: For any A^ — 2", n > 0, &i,...,5„ S 
{0, 1}, the rows of Gn and F®" with index 61 • • • b„ have 
the same Hamming weight given by 2*"«(''i'--'''") where 



WH 



(61,..., 6„) =^b. 



(73) 



is the Hamming weight of (61, ... , 6„). 

Proof: For fixed 6i,...,fe„, the sum of the terms 
(Gjv)bi--6„,6;-..b'„ (as integers) over all b[,...,b'^ e {0,1} 
gives the Hamming weight of the row of Gat with index 
61 • • -bn- From the preceding formula for (GAr)bi-b„.6'--6' , 
this sum is easily seen to be 2"'"(''i'-'''"). The proof for F®" 
is similar ■ 

C. Encoding complexity 

For complexity estimation, our computational model will 
be a single processor machine with a random access memory. 
The complexities expressed will be time complexities. The 
discussion will be given for an arbitrary Gjy-coset code with 
parameters (A^, K, A, ua^). 

Let xe{N) denote the worst-case encoding complexity over 
all ( A^, K, A, uj,c ) codes with a given block-length A^. If we 
take the complexity of a scalar mod-2 addition as 1 unit and the 
complexity of the reverse shuffle operation Rn as A^ units, we 
see from Fig. 3 that xe{N) < N/2 + N+2xe{N/2). Starting 
with an initial value xb(2) = 3 (a generous figure), we obtain 
by induction that xe{N) < ^NlogN for aU A^ = 2", n > 1. 
Thus, the encoding complexity is 0(Af logA^). 

A specific implementation of the encoder using the form 
Gn = BnF'^"- is shown in Fig. 9 for A^ = 8. The input to 
the circuit is the bit-reversed version of uf , i.e., uf — ufBs- 
The output is given by x^ = ufF®^ = uf Gg. In general, the 
complexity of this implementation is 0(A^log A^) with O(A^) 
for Bn and O(A^logAf) for F®". 

An alternative implementation of the encoder would be to 
apply uf in natural index order at the input of the circuit in 
Fig. 9. Then, we would obtain xf = ufF'^^ at the output. 
Encoding could be completed by a post bit-reversal operation: 
xf = xfBs = ufGs- 

The encoding circuit of Fig. 9 suggests many parallel 
implementation alternatives for F®": for example, with A^ 
processors, one may do a "column by column" implementa- 
tion, and reduce the total latency to log A^. Various other trade- 
offs are possible between latency and hardware complexity. 
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Ul = Ui 



Ui — U5 



U3 = -U3 



U4 — Uj 



"5 = U2 



uq = We 



■U7 = U4 



Us =Us 




and upon receiving them, computes the likelihood ratio 



Fig. 9. A circuit for implementing the transformation F®^. Signals flow 
from left to right. Each edge carries a signal or 1 . Each node adds (mod-2) 
the signals on all incoming edges from the left and sends the result out on 
all edges to the right. (Edges carrying the signals Ui and Xi are not shown.) 



In an actual implementation of polar codes, it may be 
preferable to use F®" in place of BatF®" as the encoder 
mapping in order to simplify the implementation. In that case, 
the SC decoder should compensate for this by decoding the 
elements of the source vector u^ in bit-reversed index order. 
We have included Bn as part of the encoder in this paper in 
order to have a SC decoder that decodes u^ in the natural 
index order, which simplified the notation. 

VIII. Decoding 

In this section, we consider the computational complexity 
of the SC decoding algorithm. As in the previous section, 
our computational model will be a single processor machine 
with a random access memory and the complexities expressed 
will be time complexities. Let xd{N) denote the worst- 
case complexity of SC decoding over all GAr-coset codes 
with a given block-length N. We will show that xd{N) ~ 
0{N log N). 

A. A first decoding algorithm 

Consider SC decoding for an arbitrary GAr-coset code with 
parameter {N,K,A,u_A'')- Recall that the source vector u^ 
consists of a random part u^ and a frozen part Uj\c. This 
vector is transmitted across Wn and a channel output y^ 
is obtained with probability T/^V(2/i^|w^). The SC decoder 
observes {y^,UA^) and generates an estimate u^ of u^ . 
We may visualize the decoder as consisting of N decision 
elements (DEs), one for each source element m^; the DEs are 
activated in the order 1 to A^. If i e A'^, the element Ui 
is known; so, the ith DE, when its turn comes, simply sets 
iii — Ui and sends this result to all succeeding DEs. If i E A, 
the ith DE waits until it has received the previous decisions 



(LR) 



4^(y 



1:^1 ) ^ 



W^\y^,u\-'\1) 



and generates its decision as 



otherwise 



il-i) > 1 



which is then sent to all succeeding DEs. This is a single-pass 
algorithm, with no revision of estimates. The complexity of 
this algorithm is determined essentially by the complexity of 
computing the LRs. 

A straightforward calculation using the recursive formulas 
(l22l) and (|23]l gives 
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and 



i?^(2/f,^f- 



')- 



-'-' -^"^^-'®i^f) 



'N/2 



iy' 



l-2«2i-l 



(i) t..N 
'N/2 



{y N/2+1, uT.-'). (75) 



Thus, the calculation of an LR at length N is reduced to the 
calculation of two LRs at length N/2. This recursion can be 
continued down to block-length 1 , at which point the LRs have 
the form Li {yi) = V7(yi|0)/VF(yi|l) and can be computed 
directly. 

To estimate the complexity of LR calculations, let XL{k), 
k e {N, N/2, N/A, ...,!}, denote the worst-case complexity 
of computing Li'{y^,vl~^) over i € [l,k] and (j/i,wp^) G 
y'' X X^^^ . From the recursive LR formulas, we have the 
complexity bound 



XL{k)<2xL{k/2) 



(76) 



where a is the worst-case complexity of assembling two LRs 
at length fc/2 into an LR at length k. Taking Xl (1) as 1 unit, 
we obtain the bound 



XL{N)<{l + a)N = 0{N). 



(77) 



The overall decoder complexity can now be bounded as 
Xd{N) < Kxl{N) < Nxl{N) ^0{N^). This complexity 
corresponds to a decoder whose DEs do their LR calculations 
privately, without sharing any partial results with each other 
It turns out, if the DEs pool their scratch-pad results, a 
more efficient decoder implementation is possible with overall 
complexity O(A^logA^), as we will show next. 

B. Refinement of the decoding algorithm 

We now consider a decoder that computes the full set of 
LRs, {i^^(yf ,tff ^) : 1 < i < A^}. The previous decoder 
could skip the calculation of L]^ (j/f^,u5^~^) for i e A''\ but 
now we do not allow this. The decisions {ui : 1 < i < iV} 
are made in exactly the same manner as before; in particular. 
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if i e A"^, the decision Ui is set to the known frozen value Ui, 
regardless of L^'{y^ ,u]^^). 

To see where the computational savings will come from, we 
inspect (|74l i and d75] l and note that each LR value in the pair 



JV 



is assembled from the same pair of LRs: 
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N/2yVi 
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Thus, the calculation of all N LRs at length N requires exactly 
N LR calculations at length iV/20 Let us split the N LRs at 
length N/2 into two classes, namely. 



© "?7') : 1 < * < iV/2}, 



|7-W f,,^/2 „'|2^-2 

{4/2(ywi,^i:r'):i<*<^^/2}. 



(78) 



Let us suppose that we carry out the calculations in each class 
independently, without trying to exploit any further savings 
that may come from the sharing of LR values between the 
two classes. Then, we have two problems of the same type as 
the original but at half the size. Each class in (|78] l generates a 
set of N/2 LR calculation requests at length N/A, for a total 
of N requests. For example, if we let Uj^ ' — u-^ [^ (B Ui g. , 
the requests arising from the first class are 



{Lf^/^iVi , 



-W u.^f^ 



'l,o 

„-.2i-2\ 



'') ■.l<i< N/4}, 






Using this reasoning inductively across the set of all lengths 
{N, N/2, . . . , 1}, we conclude that the total number of LRs 
that need to be calculated is N{1 + log A^). 

So far, we have not paid attention to the exact order in which 
the LR calculations at various block-lengths are carried out. 
Although this gave us an accurate count of the total number 
of LR calculations, for a full description of the algorithm, we 
need to specify an order There are many possibilities for such 
an order, but to be specific we will use a depth-first algorithm, 
which is easily described by a small example. 

We consider a decoder for a code with parameter 
{N,K,A,UA'=) chosen as (8, 5, {3, 5,6, 7,8}, (0,0, 0)). The 
computation for the decoder is laid out in a graph as shown in 
Fig. 10. There are A^(l+log A^) = 32 nodes in the graph, each 
responsible for computing an LR request that arises during 
the course of the algorithm. Starting from the left-side, the 
first column of nodes correspond to LR requests at length 
8 (decision level), the second column of nodes to requests 
at length 4, the third at length 2, and the fourth at length 1 
(channel level). 

Each node in the graph carries two labels. For example, the 
third node from the bottom in the third column has the labels 
(yf , U2®U4) and 26; the first label indicates that the LR value 
to be calculated at this node is Lg '{y^,U2 ® ua) while the 
second label indicates that this node will be the 26th node to 
be activated. The numeric labels, 1 through 32, will be used 
as quick identifiers in referring to nodes in the graph. 

^Actually, some LR calculations at length N/2 may be avoided if, by 
chance, some duplications occur, but we will disregard this. 



The decoder is visualized as consisting of N DEs situated at 
the left-most side of the decoder graph. The node with label 
{yi,u]^^) is associated with the ith DE, 1 < J < 8. The 
positioning of the DEs in the left-most column follows the 
bit-reversed index order, as in Fig. 9. 
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Fig. 10. An implementation of the successive cancellation decoder for polar 
coding at block-length N = 8. 



Decoding begins with DE 1 activating node 1 for the 
calculation of Lg (j/f). Node 1 in turn activates node 2 for 
^4 (2/1)- ^t this point, program control passes to node 2, 
and node 1 will wait until node 2 delivers the requested 
LR. The process continues. Node 2 activates node 3, which 
activates node 4. Node 4 is a node at the channel level; so it 
computes L\^ (j/i) and passes it to nodes 3 and 23, its left- 
side neighbors. In general a node will send its computational 
result to all its left-side neighbors (although this will not be 
stated explicitly below). Program control will be passed back 
to the left neighbor from which it was received. 

Node 3 still needs data from the right side and activates node 
5, which delivers L[ (2/2)- Node 3 assembles ij (2/1) from 
the messages it has received from nodes 4 and 5 and sends it to 
node 2. Next, node 2 activates node 6, which activates nodes 
7 and 8, and returns its result to node 2. Node 2 compiles its 
response L4 (yf) and sends it to node 1. Node 1 activates 
node 9 which calculates L4 (j/f) in the same manner as node 
2 calculated L^ (j/f), and returns the result to node 1. Node 
1 now assembles ig (yf) and sends it to DE 1. Since ui is a 
frozen node, DE 1 ignores the received LR, declares ui = 0, 
and passes control to DE 2, located next to node 16. 



(2)/ 



DE 2 activates node 16 for Lg {yf, ui). Node 16 assembles 



-(2)/„8 



(1)^ 



Lo {yi,ui) from the already-received LRs Ll (yf) and 
^4 (yf), and returns its response without activating any node. 
DE 2 ignores the returned LR since U2 is frozen, announces 
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0, and passes control to DE 3. 



DE 3 activates node 17 for Lg {yf,ul). This triggers LR 
requests at nodes 18 and 19, but no further. The bit us is 
not frozen; so, the decision ii-i is made in accordance with 
Lg {yfjul), and control is passed to DE 4. DE 4 activates 
node 20 for Lg (yf,uf), which is readily assembled and 
returned. The algorithm continues in this manner until finally 
DE 8 receives Lg (yi,Wi) and decides ug. 

There are a number of observations that can be made by 
looking at this example that should provide further insight into 
the general decoding algorithm. First, notice that the compu- 
tation of Lg (yf) is carried out in a subtree rooted at node 
1, consisting of paths going from left to right, and spanning 
all nodes at the channel level. This subtree splits into two 
disjoint subtrees, namely, the subtree rooted at node 2 for the 
calculation of L[ ' (yf) and the subtree rooted at node 9 for the 
calculation of L\^ (yf)- Since the two subtrees are disjoint, the 
corresponding calculations can be carried out independently 
(even in parallel if there are multiple processors). This splitting 
of computational subtrees into disjoint subtrees holds for all 
nodes in the graph (except those at the channel level), making 
it possible to implement the decoder with a high degree of 
parallelism. 

Second, we notice that the decoder graph consists of but- 
terflies (2-by-2 complete bipartite graphs) that tie together 
adjacent levels of the graph. For example, nodes 9, 19, 10, 
and 13 form a butterfly. The computational subtrees rooted 
at nodes 9 and 19 split into a single pair of computational 
subtrees, one rooted at node 10, the other at node 13. Also 
note that among the four nodes of a butterfly, the upper-left 
node is always the first node to be activated by the above 
depth-first algorithm and the lower-left node always the last 
one. The upper-right and lower-right nodes are activated by 
the upper-left node and they may be activated in any order or 
even in parallel. The algorithm we specified always activated 
the upper-right node first, but this choice was arbitrary. When 
the lower-left node is activated, it finds the LRs from its right 
neighbors ready for assembly. The upper-left node assembles 
the LRs it receives from the right side as in formula ( |74] |. 
the lower-left node as in (iTST l. These formulas show that the 
butterfly patterns impose a constraint on the completion time 
of LR calculations: in any given butterfly, the lower-left node 
needs to wait for the result of the upper-left node which in 
turn needs to wait for the results of the right-side nodes. 

Variants of the decoder are possible in which the nodal 
computations are scheduled differently. In the "left-to-right" 
implementation given above, nodes waited to be activated. 
However, it is possible to have a "right-to-left" implementation 
in which each node starts its computation autonomously as 
soon as its right-side neighbors finish their calculations; this 
allows exploiting parallelism in computations to the maximum 
possible extent. 

For example, in such a fully-parallel implementation for 
the case in Fig. 10, all eight nodes at the channel-level start 
calculating their respective LRs in the first time slot following 
the availability of the channel output vector yf. In the second 
time slot, nodes 3, 6, 10, and 13 do their LR calculations in 



parallel. Note that this is the maximum degree of parallelism 
possible in the second time slot. Node 23, for example, cannot 
calculate Lj^' (yf , mi © M2 ffi W3 W4) in this slot, because -ui 
M2 "3 W4 is not yet available; it has to wait until decisions 
Ml, U2, W3, W4 are announced by the corresponding DEs. In 
the third time slot, nodes 2 and 9 do their calculations. In time 
slot 4, the first decision ui is made at node 1 and broadcast 
to all nodes across the graph (or at least to those that need it). 
In slot 5, node 16 calculates 112 and broadcasts it. In slot 6, 
nodes 18 and 19 do their calculations. This process continues 
until time slot 15 when node 32 decides zig. It can be shown 
that, in general, this fully-parallel decoder implementation has 
a latency of 2N — 1 time slots for a code of block-length N. 



IX. Code construction 

The input to a polar code construction algorithm is a triple 
{W,N,K) where W is the B-DMC on which the code will be 
used, N is the code block-length, and K is the dimensionality 
of the code. The output of the algorithm is an information set 
A C {1,...,N} of size K such that E,e.A^(^Jv') is as 
small as possible. We exclude the search for a good frozen 
vector Uj[c from the code construction problem because the 
problem is already difficult enough. Recall that, for symmetric 
channels, the code performance is not affected by the choice 

of U^c. 

In principle, the code construction problem can be solved 
by computing all the parameters {Z{W^') : 1 < i < N} 
and sorting them; unfortunately, we do not have an efficient 
algorithm for doing this. For symmetric channels, some com- 
putational shortcuts are available, as we showed by Prop. [15] 
but these shortcuts have not yielded an efficient algorithm, 
either. One exception to all this is the BEC for which the 
parameters {Z(Wj^ )} can all be calculated in time 0{N) 



thanks to the recursive formulas 

Since exact code construction appears too complex, it makes 
sense to look for approximate constructions based on estimates 
of the parameters {Z{W^ )}. To that end, it is preferable 
to pose the exact code construction problem as a decision 
problem: Given a threshold 7 e [0, 1] and an index i £ 
{!,..., N}, decide whether i G A^ where 



A^ 



A 



{le {!,..., N}:ZiWJ^^)<^} 



Any algorithm for solving this decision problem can be used 
to solve the code construction problem. We can simply run 
the algorithm with various settings for 7 until we obtain an 
information set A^ of the desired size K. 

Approximate code construction algorithms can be proposed 
based on statistically reliable and efficient methods for es- 
timating whether i G A-y for any given pair (^,7). The 
estimation problem can be approached by noting that, as we 
have implicitly shown in ( |53] |, the parameter Z{W^ ) is the 
expectation of the RV 






(79) 
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where (C/{^,1Y^) is sampled from the joint probability as- 
signment Pyiv.y«(Mf ,2/f ) = 2-^WNiyi\u^)- A Monte- 
Carlo approach can be taken where samples of {U^ ,Y^) 
are generated from the given distribution and the empirical 
means {Z{W^)} are calculated. Given a sample {u^,yi) 
of (C/f ,^1^), the sample values of the RVs ^ can all be 
computed in complexity 0{N log N). A SC decoder may be 
used for this computation since the sample values of ( |79] l are 
just the square-roots of the decision statistics that the DEs in 
a SC decoder ordinarily compute. (In applying a SC decoder 
for this task, the information set A should be taken as the null 
set.) 

Statistical algorithms are helped by the polarization phe- 
nomenon: for any fixed 7 and as N grows, it becomes easier 
to resolve whether Z{W^ ) < 7 because an ever growing 
fraction of the parameters {Z{W^ )} tend to cluster around 
or 1. 

It is conceivable that, in an operational system, the esti- 
mation of the parameters {Z(Wj^ )} is made part of a SC 
decoding procedure, with continual update of the information 
set as more reliable estimates become available. 



X. A NOTE ON THE RM RULE 

In this part, we return to the claim made in Sect. ll-Dl that the 
RM rule for information set selection leads to asymptotically 
unreUable codes under SC decoding. 

Recall that, for a given {N,K), the RM rule constructs a 
Gjv-coset code with parameter {N, K, A, u^c) by prioritizing 
each index i e {!,..., N} for inclusion in the information set 
A w.r.t. the Hamming weight of the ith row of Gat. The RM 
rule sets the frozen bits w^c to zero. In light of Prop. [17] the 
RM rule can be restated in bit-indexed terminology as follows. 

RMrule: For a given {N,K), with A^ = 2", n > 0, < 
K < N, choose A as follows: (i) Determine the integer r such 
that 



E';^-<E 



fc=r-l 



(80) 



,bn) > r into 



(ii) Put each index bi ■■ -bn with wnibi, 
A. (iii) Put sufficiently many additional indices 61 • • • fe„ with 
wnibi, ■ ■ ■ , bn) = r — 1 into A to complete its size to K. 
We observe that this rule will select the index 



0" 



"r = o---oi---i 



for inclusion in A. This index turns out to be a particularly 
poor choice, at least for the class of BECs, as we show in the 
remaining part of this section. 

Let us assume that the code constructed by the RM rule is 
used on a BEC W with some erasure probability e > 0. We 
will show that the symmetric capacity /(Wo^-rir) converges 
to zero for any fixed positive coding rate as the block-length 
is increased. For this, we recall the relations (|6]l, which, in 
bit-indexed channel notation of Sect. HV] can be written as 



follows. For any £ > 1, bi, . . . ,bi £ {0, 1}, 

IiWb,...b,o) = I{Wb,...b,f 

IiWb,...bel) = 2I{Wb,...be) - I{Wb,...b,y 

< 2I{Wb,...b,) 

with initial values I{Wo) = P{W) and I{Wi) = 2I{W) 
l'^{W). These give the bound 



IiWo.-r,r)<2-{l-e) 



2"' 



(81) 



Now, consider a sequence of RM codes with a fixed rate < 
R < 1, N increasing to infinity, and K = [NR\. Let r{N) 
denote the parameter r in dSOl ) for the code with block-length 
A^ in this sequence. Let n = log2(A^). A simple asymptotic 
analysis shows that the ratio r{N)/n must go to 1/2 as A^ is 
increased. This in turn implies by dSTI ) that /(Won-r^r) must 
go to zero. 

Suppose that this sequence of RM codes is decoded using a 
SC decoder as in Sect. II-C.21 where the decision metric ignores 
knowledge of frozen bits and instead uses randomization over 
all possible choices. Then, as A^ goes to infinity, the SC 
decoder decision element with index Q""'"!'' sees a channel 
whose capacity goes to zero, while the corresponding element 
of the input vector u^ is assigned 1 bit of information by 
the RM rule. This means that the RM code sequence is 
asymptotically unreliable under this type of SC decoding. 

We should emphasize that the above result does not say 
that RM codes are asymptotically bad under any SC decoder, 
nor does it make a claim about the performance of RM 
codes under other decoding algorithms. (It is interesting that 
the possibility of RM codes being capacity-achieving codes 
under ML decoding seems to have received no attention in 
the literature.) 

XI. Concluding remarks 

In this section, we go through the paper to discuss some 
results further, point out some generalizations, and state some 
open problems. 

A. Rate of polarization 

A major open problem suggested by this paper is to deter- 
mine how fast a channel polarizes as a function of the block- 
length parameter A^. In recent work [12], the following result 
has been obtained in this direction. 

Proposition 18: Let 14^ be a B-DMC. For any fixed rate 
R < I{W) and constant (3 < \, there exists a sequence of 
sets {An} such that An <Z {I, ■ ■ ■ , N), \An\ > NR, and 



E 



Z{W^^)^o{2~^'). 



(82) 



Conversely, if i? > and /3 > i, then for any sequence of 
sets {An} with An C {1,..., N}, \An\ > NR, we have 



.W> 



-NfU 



max{Z(W^;V*0 : i £ An} = ^^(2"'^ ). (83) 

As a corollary. Theorem [3] is strengthened as follows. 
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Proposition 19: For polar coding on a B-DMC W at any 
fixed rate R < I{W), and any fixed (3 < \, 



PAN,R)^o{2-^'). 



(84) 



This is a vast improvement over the 0{N^i) bound proved 
in this paper. Note that the bound still does not depend on 
the rate R as long as i? < I{W). A problem of theoretical 
interest is to obtain sharper bounds on Pe {N, R) that show a 
more explicit dependence on R. 

Another problem of interest related to polarization is ro- 
bustness against channel parameter variations. A finding in 
this regard is the following result [13]: If a polar code is 
designed for a B-DMC W but used on some other B-DMC 
W', then the code will perform at least as well as it would 
perform on W provided W is a degraded version of W' in 
the sense of Shannon [14]. This result gives reason to expect a 
graceful degradation of polar-coding performance due to errors 
in channel modeling. 

B. Generalizations 
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Fig. 11. General form of channel combining. 

The polarization scheme considered in this paper can be 
generalized as shown in Fig. 11. In this general form, the 
channel input alphabet is assumed g-ary, X = {0, 1, . . . , ?— 1}, 
for some q > 2. The construction begins by combining m 
independent copies of a DMC W : X ^ y to obtain Wm, 
where tti > 2 is a fixed parameter of the construction. The 
general step combines m independent copies of the channel 
W^/m from the previous step to obtain Wn- In general, 



the size of the construction is A^ = m" after n steps. The 
construction is characterized by a kernel Fm ■ X"^ xTZ ^ A"™ 
where 7^ is some finite set included in the mapping for 
randomization. The reason for introducing randomization will 
be discussed shortly. 



The vectors u| e X and j/{ e 3^ in Fig. 11 denote 
the input and output vectors of Wn- The input vector is first 

e X^ by breaking it into A^ con- 



transformed into a vector s;J^ 



N 



' ''^N-m+1' 



secutive sub-blocks of length m, namely, u™ 
and passing each sub-block through the transform Fm- Then, 
a permutation Rn sorts the components of s'^ w.r.t. mod-?Ti 
residue classes of their indices. The sorter ensures that, for any 
1 < k < m, the kth copy of WN/mi counting from the top of 
the figure, gets as input those components of s^ whose indices 
are congruent to k mod-m. For example, vi = Si, V2 = Sm+i, 

"^N/m ~ S(N/m-l)m+l^ ^N/m+l = ■52, l'Ar/m+2 = Sm+2, and 

SO on. The general formula is Vf-N/m+j = ■5/£+(j-i)m+i f™" 
all < A: < (m - 1), 1 < j < N/m. 

We regard the randomization parameters ri, . . . , r^ as being 
chosen at random at the time of code construction, but fixed 
throughout the operation of the system; the decoder operates 
with full knowledge of them. For the binary case considered 
in this paper, we did not employ any randomization. Here, 
randomization has been introduced as part of the general 
construction because preliminary studies show that it greatly 
simplifies the analysis of generalized polarization schemes. 
This subject will be explored further in future work. 

Certain additional constraints need to be placed on the 
kernel Fm to ensure that a polar code can be defined that is 
suitable for SC decoding in the natural order ui to un- To that 
end, it is sufficient to restrict Fm to unidirectional functions, 
namely, invertible functions of the form Fm '- (w™,r) e 
A"" X 7^ K^ a;7 G A"" such that Xi = hiu'^.r), for a 
given set of coordinate functions fi : ;f ™-«+i x TZ ^ X, 
i — 1, . . . ,m. For a unidirectional Fm, the combined channel 
Wn can be split to channels {VF^ } in much the same way 
as in this paper. The encoding and SC decoding complexities 
of such a code are both O(A^logA^). 

Polar coding can be generalized further in order to overcome 
the restriction of the block-length A^ to powers of a given 
number m by using a sequence of kernels F^., i — 1, . . . ,n, 
in the code construction. Kernel Fm^ combines mi copies 
of a given DMC W to create a channel Wmi- Kernel F^j 
combines m2 copies of Wmi to create a channel Wmim2, etc., 
for an overall block-length of A^ = Y\d=i "^i- ^^ ^^^ kernels are 
unidirectional, the combined channel Wn can still be split into 
channels W^ whose transition probabilities can be expressed 
by recursive formulas and 0{N log A^) encoding and decoding 
complexities are maintained. 

So far we have considered only combining copies of one 
DMC W - Another direction for generalization of the method is 
to combine copies of two or more distinct DMCs. For example, 
the kernel F considered in this paper can be used to combine 
copies of any two B-DMCs W, W - The investigation of 
coding advantages that may result from such variations on the 
basic code construction method is an area for further research. 

It is easy to propose variants and generalizations of the 
basic channel polarization scheme, as we did above; however. 
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it is not clear if we obtain channel polarization under each 
such variant. We conjecture that channel polarization is a 
common phenomenon, which is almost impossible to avoid as 
long as channels are combined with a sufficient density and 
mix of connections, whether chosen recursively or at random, 
provided the coordinatewise splitting of the synthesized vector 
channel is done according to a suitable SC decoding order 
The study of channel polarization in such generality is an 
interesting theoretical problem. 



C. Iterative decoding of polar codes 

We have seen that polar coding under SC decoding can 
achieve symmetric channel capacity; however, one needs to 
use codes with impractically large block lengths. A question 
of interest is whether polar coding performance can improve 
significantly under more powerful decoding algorithms. The 
sparseness of the graph representation of F®" makes Gal- 
lager's belief propagation (BP) decoding algorithm [15] appli- 
cable to polar codes. A highly relevant work in this connection 
is [16] which proposes BP decoding for RM codes using a 
factor-graph of i^®", as shown in Fig. 12 for iV = 8. We 
carried out experimental studies to assess the performance of 
polar codes under BP decoding, using RM codes under BP de- 
coding as a benchmark [17]. The results showed significantly 
better performance for polar codes. Also, the performance of 
polar codes under BP decoding was significantly better than 
their performance under SC decoding. However, more work 
needs to be done to assess the potential of polar coding for 
practical applications. 
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Fig. 12. The factor graph representation for the transformation _F®3 



Appendix 
A. Proof of Proposition \J} 

The right hand side of ([T]i equals the channel parameter 
Eo{l, Q) as defined in Gallager [10, Section 5.6] with Q taken 
as the uniform input distribution. (This is the symmetric cutoff 
rate of the channel.) It is well known (and shown in the same 
section of [10]) that I{W) > Eoil., Q). This proves ©. 

To prove (O, for any B-DMC W : X ^y, define 

d{W)^\Y,\W{y\Q)-W[y\l)\. 
vey 

This is the variational distance between the two distributions 
W{y\0) and W{y\l) over y e y. 

Lemma 2: For any B-DMC W, I{W) < d(W). 

Proof: Let W be an arbitrary B-DMC with output 



alphabet y — {1, . . . ,n} and put Pi 
i = 1, . . . ,n. By definition. 



W{i\0),Q, = W{i\l), 



1 

1=1 



P^l0g- 



P, 



-'i V 2 ^i 



Qi log - 



Q^ 



Jj "r 2 ^i 



The ith bracketed term under the summation is given by 

-25 



f{x) ^xlog- 



(x + 2(5) log 



X + 5 X + 5 

where x — min{Pi, Qi\ and 5 — ^\Pi~Qi\. We now consider 
maximizing f{x) over < x < 1 — 2(5. We compute 



df 1 , Jx{x + 2(5) 

= — lOE — 

dx 2 ^ {x + 5) 



and recognize that \Jx{x + 2(5) and {x + 5) are, respectively, 
the geometric and arithmetic means of the numbers x and 
{x + 25). So, df /dx < and f{x) is maximized at a; = 0, 
giving the inequality f{x) < 26. Using this in the expression 
for I{W), we obtain the claim of the lemma. 






= d{W). 



Lemma 3: For any B-DMC W, d{W) < y/l - Z{W)^. 
Proof: Let W be an arbitrary B-DMC with output 
alphabet y = {l,...,n} and put P, = W{i\0), Q, = 
W{i\l), i = l,...,n. Let (5, = ^\P^ - Q.,\, 5 = d{W) = 
S"^i (5, and R, = (P, + Q,)/2. Then, we have Z{W) ^ 
J27=i Vi^i ~ ^i)iPi + ^i)- Clearly, Z{W) is upper-bounded 
by the maximum of ^2^=1 v^T^^i ^^^i" i'^i} subject to the 
constraints that < Si < Ri, i — I,. . . ,n, and J2"=i ^« — 
5. To carry out this maximization, we compute the partial 
derivatives of Z{W) with respect to Si, 



dZ_ 

dSl 



Vm 



W 



m 



'VW^' 



and observe that Z{W) is a decreasing, concave function of 
Si for each i, within the range < Si < Ri. The maximum 
occurs at the solution of the set of equations dZ/dSi = k, all 
i, where fc is a constant, i.e., at Si ~ Riy/WJJT+W). Using 
the constraint J^t^i — ^ ^^'^ '^^e fact that X]"=i ^i = 1' ^^ 
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find yjk? /{I + fc2) = 5. So, the maximum occurs at 5i = SRi 
and has the value X]"=i V^l ~ ^^^1 = Vl — S'^- We have 
thus shown that Z{W) < -y/l — d{W)'^, which is equivalent 

to d{w) < y/i-z{w)^. m 

From the above two lemmas, the proof of (|2|i is immediate. 



B. Proof of Proposition \3\ 
To prove (l22l i. we write 



E^ E ^w/.(j^r+ilO 

"2i+l,c 



(85) 



2iV 
"21+1,0 



By definition (|5]l, the sum over U2i+i o f^^" ^i^Y fixed u^^ 
equals 

because, as w^^^ „ ranges over X^-\ u^f+i,^ © uli^i,e 
ranges also over X^^'-. We now factor this term out of the 
middle sum in ( [85] ) and use (|5]l again to obtain ( l22l l. For the 
proof of ( |23] ), we write 



,,2iV 

"2! + l 



J E i^^(yr+il<) 



2 A^ 2^ 

2i + l, 



E i^^(j^fK®0- 



2JV 
-'2i+l,o 



By carrying out the inner and outer sums in the same manner 
as in the proof of (l22l l. we obtain 



C. Proof of Proposition |4] 

Let us specify the channels as follows: W : X ^ y, W : 
X ^ Y, and W" : X ^ Y x X. By hypothesis there is a 
one-to-one function f : y ^ y such that ( fTTj l and ( fTST l are 
satisfied. For the proof it is helpful to define an ensemble of 
RVs {Ui,U2, Xi, X2,Yi,Y2,Y) so that the pair (C/i,C/2) is 
uniformly distributed over X'^, (Xi,X2) = (C/i © U2,U2), 
PYi,Y2\Xi,X2(.yi:y2\xi,X2) = Wiyi\xi)W{y2\x2), and Y = 
f{Yi,Y2). We now have 

W"{y,ui\u2) = PYu^iu^{y,ui\u2). 



From these and the fact that (Yi,l2) i-^ I' is invertible, we 
get 

I{W') ^ I{Ui;Y) = I{Ui;YiY2), 
I{W") = I{U2;YUi) = I{U2 ; YiFat/i). 

Since Ui and U2 are independent, /(C/2; ^i^2C^i) equals 
I{U2;YiY2\Ui). So, by the chain rule, we have 

I{W') + I{W") = /(C/1C/2; FiFa) = I{XiX2;YiY2) 

where the second equality is due to the one-to-one relation- 
ship between (Xi,X2) and {Ui,U2)- The proof of (|24] | is 
completed by noting that /(X1X2; Y1Y2) equals /(^i; Yi) + 
^(-^2;>^2) which in turn equals 2I{W). 
To prove dZSl l. we begin by noting that 

I{W") ^ I{U2;YiY2Ui) 

= I{U2;Y2) + I{U2;YiUi\Y2) 
^I{W)+I{U2;YiUi\Y2). 

This shows that I(W") > I{W). This and ^ give 
(IZSl l. The above proof shows that equality holds in (IZSl l iff 
/([/2; yit^i|^2) = 0, which is equivalent to having 

PuiM2,Yi\Y2{ui,U2,yi\y2) = PuuY^\Y2{ul,yl\y2) 

■ Pu2\Y2iu2\y2) 

for all {ui,U2,yi,y2) such that Py2(j/2) > 0, or equivalently, 

^Yi,Y2|C/i,C/2(2/l.y2|wi,U2)iV2(y2) 

= ^yi,y2|c/i(yi'2/2|ui)-Py2|c/2(2/2|u2) (86) 

for all {ui,u2,yi,y2)- Since Py^^y2|c/i,c/2(yi,y2|wi,u2) = 
W{yi\ui © U2)W^(2/2|u2), (l86] l can be written as 

W{y2\u2) [W{yi\ui®U2)PY2{y2) ~ ^^,^2(2/1,2/2^1)] = 0. 

(87) 

Substituting PY2{y2) = 51^(^21^2) + 5^^(2/2 1^2 © 1) and 



-Py-i,1'2|c/i(2/i,2/2|wi) = ■^W{yi\ui®U2)W{y2\u2) 

+ -W{yi\ui © M2 © l)W{y2\u2 © 1) 
into dSTj i and simplifying, we obtain 



W^(2/2|m2)W^(2/2|m2©1) 

• [W{yi\ui ® U2) - W{yi\ui ® U2 © 1)] = 0, 

which for all four possible values of (ui, U2) is equivalent to 

W{y2\0)W{y2\l) [Wiy,\0) - W{yi\l)] = 0. 

Thus, either there exists no 1/2 such that Ty(y2|0)T^(?;2|l) > 0, 
in which case I{W) = 1, or for all yi we have M^(2/i|0) = 
W{yi\l), which implies I{W) = 0. 
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D. Proof of Proposition |5] 

Proof of ( |26] l is straightforward. 

z{w") - ^ v/w^"(/(2/i,y2),«i|0) 
•Vw"'(/(2/i,y2),Mi|i) 

= E ^ ^W(y, I wi)W-(y2 I 0) 

• Vw^(yi I wi © i)i^(y2 1 1) 

= E ^^(2/2 I 0)l^(y2 I 1) 



■Y.\ E ^^(yi I "i)^(2/i I "1 ® 1) 



2 

Ml J/1 



To prove (IZTT i. we put for shorthand a{yi) = W{yi\Q), 
S{yi) = W{yi\l), P{y2) = Wiy2\0), and 7(^2) = W^(y2|l), 
and write 



^(W^O = E VWifiyi,y2mW'{f{yi,y2m 

= E 2'J'^^V'^)P(vi) + ^(yi)7(y2) 



• ^"(^1)7(2/2) + S{yi)f3{y2) 
E 2 [V"(yi)/9(2/2) + ^'5(2/1)7(2/2) 

[^"(2/1)7(2/2 



< 



Vl 



\/'5(2/i)/3(2/2) 
E V"(2/i)/3(2/2)^(2/i)7(2/2) 



where the inequahty follows from the identity 

2 



^/{al3 + Sj){aj + 6l3) 

= (v/^ + v/^)(v^ + VSP) ~ 2y/a(3Sj 
Next, we note that 

E"(yi)V/3(2/2)7(2/2) = ^(VF). 



Likewise, each term obtained by expanding {\/a{yi)P{y2) + 
V<5(2/i)7(y2))(v'a(yi)7(2/2) + V'5(2/ i)/3(2/2)) gives Z{W) 
when summed over y^. Also, \/a{yi)P{y2)S{yi)j{y2) 
summed over yf equals ZCW)"^. Combining these, we obtain 
the claim (IZTJ l. Equality holds in dZTJ l iff, for any choice of 
yf, one of the following is true: a(2/i)/3(2/2)7(2/2)'5(j/i) — 
or a{yi) = S{yi) or f3{y2) — 7(2/2)- This is satisfied if W 
is a BEC. Conversely, if we take j/i = y2, we see that for 
equality in ( |27| |. we must have, for any choice of yi, either 
a(yi)(5(yi) = or a(yi) — S{yi); this is equivalent to saying 
that VF is a BEC. 



To prove (1281 1. we need the following result which states 
that the parameter Z{W) is a convex function of the channel 
transition probabilities. 

Lemma 4: Given any collection of B-DMCs Wj : X ^ y, 
j G J", and a probability distribution Q on J', define W : X -^ 
y as the channel W{y\x) = Y^^^jQiiWAvV)- Then, 



E 

J6^ 



Q{3)Z{W,) < Z{W). 



(88) 



Proof: This follows by first rewriting Z{W) in a different 
form and then applying Minkowsky's inequality [10, p. 524, 
ineq. (h)]. 



Z{W) 






VWiy\0)W{y\l) 



= -1 



^E 



-\ 2 



J2VwM^ 



Ev^j-(yi^) 



= J2Q{j)ZiW,). 

■ 
We now write W' as the mixture 

W^'(/(2/i,2/2)|wi) = I [Woiyf I ui) + Wiiyllui)] 

where 

Wo{yl\u,)^W{yi\ui)W{y2\0), 
Wi{yf\u,) = W{yi\u,®l)W(y2\l), 

and apply Lemma |4] to obtain the claimed inequality 

ZiW) > - [Z{Wq) + Z{Wi)] = Z{W). 

Since < Z{W) < 1 and Z{W") = Z{WY, we have 
Z{W) > Z{W"), with equality iff Z{W) equals or 1. Since 
Z{W') > Z{W), this also shows that Z{W') = Z{W") iff 
Z(W^) equals or 1. So, by Prop. [U Z{W') = Z{W") iff 
/(ly) equals 1 or 0. 

E. Proof of Proposition |6| 

From ( fTTb . we have the identities 

W''(/(2/i,2/2)|0)M/'(/(yi, 2/2)11) = 

\ [W{yi\Qf + W{y^\lf] W {y2\Q)W {y2\l)+ 

\ [W{y2\0f + W{y2\lf]W{yi\0)W{yi\l), (89) 

M^'(/(2/i,2/2)|0)-W^'(/(yi,y2)|l) = 

i [W{y^\Q) - W{yi\l)] [W{y2\Q) - W{y2\l)] . (90) 

Suppose H/^ is a BEC, but W' is not. Then, there exists (1/1,2/2) 
such that the left sides of ( [89] l and ( |90b are both different from 
zero. From (|90] l, we infer that neither j/i nor 2/2 is an erasure 
symbol for W. But then the RHS of ( [89] l must be zero, which 
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is a contradiction. Thus, W must be a BEC. From ( |90] l, we 
conclude that /(yi, 1/2) is an erasure symbol for W' iff either 
2/1 or 2/2 is an erasure symbol for W. This shows that the 
erasure probability for W' is 2e — e^, where e is the erasure 
probability of W. 

Conversely, suppose W' is a BEC but W is not. Then, there 
exists yi such that W{yi\0)W{yi\l) > and VF(yi|0) - 
M^(yi|l) 7^ 0. By taking y2 = J/i, we see that the RHSs of 
(|89] l and ( |90l l can both be made non-zero, which contradicts 
the assumption that W is a BEC. 

The other claims follow from the identities 



[12] E. Ankan and E. Telatar, "On the rate of channel polarization," Aug. 

2008, arXiv:0807.3806v2 [cs.IT]. 
[13] A. Sahai, P. Glover, and E. Telatar. Private communication, Oct. 2008. 
[14] C. E. Shannon, "A note on partial ordering for communication channels," 

Information and Control, vol. 1, pp. 390-397, 1958. 
[15] R. G. Gallager, "Low-density parity-check codes," IRE Trans. Inform. 

Theory, vol. IT-8, pp. 21-28, Jan. 1962. 
[16] G. D. Forney Jr., "Codes on graphs: Normal realizations," IEEE Trans. 

Inform. Theory; vol. lT-47, pp. 520-548, Feb. 2001. 
[17] E. Ankan, "A performance comparison of polar codes and Reed-Muller 

codes," IEEE Comm. Letters, vol. 12, pp. 447^49, June 2008. 



W"(/(yi,2;2),«i|0)W^"(/(yi,2;2),«i|l) 

= \w{yi\ui)W(yi\ui ® l)W {v2\0)W {y2\l) , 

W"{f{yi,y2),u,\0)-W"{f{y^,y2),u,\l) 

= I [W{yi\ui)W{y2\0) - Wiyi\u, ® l)W{y2\l)] . 

The arguments are similar to the ones already given and we 
omit the details, other than noting that (/(yi,2/2), ui) is an 
erasure symbol for W" iff both yi and 1/2 are erasure symbols 
for W^. 



F. Proof of Lemma Q] 

The proof follows that of a similar result from Chung 
[9, Theorem 4.1.1]. Fix C > 0. Let ^o ^ {oj e n : 
lim„^oo Zn{u;) = 0}. By Prop.[TOl P(flo) = Iq. Fix lu G ^q. 
Zn(ijj) -^ implies that there exists n^luj^C,) such that 
n > no(w,C) ==> Znibj) < (■. Thus, oj £ %n{C) for some 
m. So, ilo C U™=i^m(C). Therefore, P {[J7r=,%„iC)) > 
P(flo)- Since T„(C) T Um=i'^m(C)' by the monotone 
convergence property of a measure, linim^oo P [%n{C)] = 

P[Un^=l%n{C)]- So, lim„^ooP[7;„(C)] > /O- It folloWS 

that, for any ( > 0, S > 0, there exists a finite mo — too(C, S) 
such that, for all to > mo, P[%n{C)] > h ~ 6/2. This 
completes the proof. 
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